Deriving Transition Probabilities and Evolutionary Distances from Substitution Rate Matrix by Probability Reasoning

Xia X

REVIEW ARTICLE | VOLUME 4, ISSUE 1 | OPEN ACCESS DOI: 10.23937/2378-3648/1410031

Deriving Transition Probabilities and Evolutionary Distances from Substitution Rate Matrix by Probability Reasoning

Xuhua Xia^1,2*

¹Department of Biology, University of Ottawa, Canada

²Ottawa Institute of Systems Biology, Ottawa, Canada

^*Corresponding author: Xuhua Xia, Department of Biology, Ottawa Institute of Systems Biology, University of Ottawa, 30 Marie Curie, P.O. Box 450, Station A, Ottawa, Ontario, K1N 6N5, Canada, Tel: (613)-562-5800, Ext: 6886, Fax: (613)-562-5486.

Accepted: July 24, 2017 | Published: July 27, 2017

Citation: Xia X (2017) Deriving Transition Probabilities and Evolutionary Distances from Substitution Rate Matrix by Probability Reasoning. J Genet Genome Res 3:031. doi.org/10.23937/2378-3648/1410031

Copyright: © 2017 Xia X. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Abstract

Substitution rate matrices are used to correct multiple hits at the same sites, which requires the derivation of transition probabilities and evolutionary distances from substitution rate matrices. The derivation is essential in molecular phylogenetics and phylogenomics, and represents the only statistically sound way for developing scoring matrices used in sequence alignment and local string matching (e.g., BLAST and FASTA). Three different approaches are frequently used for deriving transition probabilities and evolutionary distances: 1) The probability reasoning, 2) Solving partial differential equations, and 3) Matrix exponential and logarithm. The first approach demands the least amount of mathematical skills but offers the best way for conceptual understanding, and can often generate nice mathematical expressions of transition probabilities and evolutionary distances. This review represents the most systematic and comprehensive numerical illustration of the first approach.

Keywords

Substitution model, Substitution rate, Transition probability, Evolutionary distance

Introduction

Substitutions occur over time and can overwrite each other at the same nucleotide or amino acid site. When we compare two homologous nucleotide sequences and find differences in N sites, the actual number of substitutions (designated by M) could be much greater than N because multiple substitutions could have happened at the same site, overwriting each other. Substitution models are used to infer the observable M from the observed substitutions from sequence comparisons.

Many substitution models have been proposed for nucleotide, amino acid and codon sequences. All substitution models used in molecular phylogenetics are Markov chain models characterized by 1) Either a transition probability matrix (P) with discrete time or a rate matrix (Q) in continuous time where P can be derived from Q, and 2) Equilibrium frequencies. The general form of an instantaneous rate matrix for nucleotide sequences is, in the order of A, G, C, and T:

$Q = \begin{matrix} A \\ G \\ C \\ T \end{matrix} [\begin{matrix} - & a & b & c \\ g & - & d & e \\ h & i & - & f \\ j & k & l & - \end{matrix}] (1)$

Transition probability matrix, often referred to as the P matrix, specifies the probability of a nucleotide or amino acid changing into another one after time t. It is needed to calculate likelihood and to derive evolutionary distances, and consequently is needed phylogenetics based on the maximum likelihood and distance-based methods as well as Bayesian inference. Whether a substitution model can be implemented for phylogenetic analysis essentially depends on whether the model’s transition probabilities can be calculated.

There are three ways to obtain transition probabilities from the Q matrix [1,2] :1) By probability reasoning, 2) By solving differential equations involving rates, and 3) By taking the matrix exponential of the rate matrix. The last two require some mathematical background in calculus and linear algebra. The first, in contrast, demands little mathematical skill except for careful book-keeping and solving simultaneous equations. This approach is particularly relevant to biological students not only for gaining a conceptual understanding of the substitution models, but also to deriving nice mathematical expressions for transition probabilities and evolutionary distances. New researchers often ask why we can derive evolutionary distances between two aligned sequences for the TN93 model [3] but cannot for the simpler HKY85 model [4] which is a special case of the TN93 model, yet another model, F84 (used in PHYLIP since 1984), which is also a special case of the TN93 model, can have its evolutionary distance readily derived. One can easily obtain answers to such questions by taking the first approach. However, the first approach is not of general purpose and cannot handle very complicated substitution models. In contrast, the last one can be used with any substitution models specified by a rate matrix from which the matrix exponential can be obtained. In short, all these approaches need to be learned by anyone wishing to become a molecular phylogeneticist, but this paper will focus only on the probability reasoning approach illustrated with JC69 [5], K80 [6], F84 (the model used in PHYLIP since 1984), HKY85 [4], and TN93 [3] models.

Probability Reasoning to Obtain Transition Probabilities and Evolutionary Distances

Felsenstein [1] presented nice examples of probability reasoning to derive transition probabilities and evolutionary distances from rate matrices. This section presents the approach in a more systematic and accessible way.

JC69 model

Consider nucleotide A in the JC69 model (Figure 1a). Imagine that the nucleotide has a rate α of changing into any of the four nucleotides, i.e., including changing to itself (Figure 1b). This is effectively the same specification as the JC69 model. After time t, the expected number of substitutions is 4αt and the probability of no substitution is p (x = 0, α, t) = e^-4αt according to the Poisson distribution, and the probability of having at least one change is then p (x ≥ 1, α, t) = 1-e^-4αt (Figure 1c). Because nucleotide A can change into any one of the four nucleotides (including nucleotide A itself), each nucleotide gets 1/4 of p (x ≥ 1, α, t). We therefore have in Figure 1.

$p_{i j} (t) = \frac{p (x \geq 1, α, t)}{4} = \frac{1}{4} - \frac{1}{4} e^{- 4 α t} (2)$

Figure 1: Derivation of transition probabilities and the evolutionary distance (D) based on the JC69 model. The d value in the diagonal of the rate matrix (a) is constrained by the row sum equal to 0, i.e., d = -3α. P(j|i,t) means the probability of changing from the original nucleotide i to nucleotide j after time t, and is synonymous to p_ij(t) or simply p_ij in this paper. View Figure 1

The transition probability p_ii(t) is the summation of two probabilities: the probability of no change (which is e ^4αt) and the probability of changing to itself which is the same as specified in Eq. (2), as shown in Figure 1e, i.e.,

$p_{i i} (t) = e^{- 4 α t} + \frac{p (x \geq 1, α, t)}{4} = \frac{1}{4} + \frac{3}{4} e^{- 4 α t} (3)$

The transition probability matrix for the JC69 model has only two distinct elements. The four diagonal elements are the same as specified in Eq. (3) and all the off-diagonal elements are the same as specified in Eq. (2). Each row in P adds up to 1 as a nucleotide can either stay the same or change into some other nucleotides.

There are some quick ways to check the derived transition probabilities. First, we note that when t approaches infinity, then all entries in matrix P approaches ¼ if α > 0. This is what we have expected. Second, when t = 0, then all diagonal elements in matrix P are 1 and all off-diagonal elements are zero. This is again what we expected. Third, if α is zero, then no change is possible, and we again expect all diagonal elements in matrix P to be 1 and all off-diagonal elements to be zero, which is also true.

From p_ij in Eq. (2), the expected proportion of sites that are different between two aligned homologous sequences (p_diff) is 3^*p_ij(t), i.e.,

$p_{d i f f} = 3 p_{i j} (t) = \frac{3}{4} - \frac{3}{4} e^{- 4 α t} (4)$

Note that p_diff approaches ¾ when t is infinitely large, which means that multiple substitutions can no longer be corrected. Eq. (4) offers another way of deriving p_ii(t) in Eq. (3), i.e., p_ii(t) is simply 1-p_diff.

Eq. (4) allows us to derive the JC69 distance (D_JC69) because a distance is defined as μt where μ is the substitution rate which is equal to 3α in the JC69 model. This is the same as the distance that you have driven is the product of the speed (rate) and time. Given that D_JC69 = 3αt, we can derive D_JC69 (Figure 1g) by substituting αt = D_JC69/3 into Eq. (4), i.e.,

$D_{J C 69} = - \frac{3}{4} \ln (1 - \frac{4 p_{d i f f}}{3}) (5)$

Where p_diff (the expected number of sites that are different between the two homologous sequences) can be approximated by the observed proportion of sites (p_diff.obs) differing between the two aligned sequences. Note that p_diff.obs may differ from p_diff even when the underlying substitution model indeed follows JC69 because of 1) Stochastic factors due to limited aligned length of the two sequences, and 2) Distortion caused by suboptimal sequence alignment. Thus, although p_diff in Eq. (4) cannot be greater than 0.75, p_diff.obs could, even when sequences evolve strictly according to the JC69 model. D_JC69 is not defined when p_diff ≥ 0.75 as there is no logarithm for 0 or negative values.

We can optionally show that D_JC69 in Eq. (5) is a maximum likelihood distance. For two aligned sequences of length N, designate the number of sites that differ between the two sequences as N_D and the number of sites identical between the two sites as (N-N_D). Now the likelihood function is:

$L = {(\frac{1}{4})}^{N} p_{i i}^{(N - N_{D})} {(1 - p_{i i})}^{N_{D}}$

$\ln L = N \ln (\frac{1}{4}) + (N - N_{D}) \ln (p_{i i}) + N_{D} \ln (1 - p_{i i})$

$= N \ln (\frac{1}{4}) + (N - N_{D}) \ln (\frac{1}{4} + \frac{3}{4} e^{- 4 D_{J C 69} / 3}) + N_{D} \ln (\frac{3}{4} - \frac{3}{4} e^{- 4 D_{J C 69} / 3}) (6)$

Where the constant term N^*ln(1/4) can be dropped in maximizing lnL to obtain the distance estimate, but needs to be kept when performing likelihood ratio test for comparing different substitution models (e.g., JC69 against TN93).

We take the derivative of lnL with respect to D_JC69, set the derivative to 0 and solve for D_JC69. The resulting D_JC69 is exactly the same as that in Eq. (5). I used D instead of D_JC69 in the equations below:

$\frac{d \ln L}{d D} = - \frac{(N - N_{D}) e^{- 4 D / 3}}{\frac{1}{4} + \frac{3}{4} e^{- 4 D / 3}} + \frac{N_{D} e^{- 4 D / 3}}{\frac{3}{4} - \frac{3}{4} e^{- 4 D / 3}} = 0$ $D = - \frac{3}{4} \ln (\frac{3 N - 4 N_{D}}{3 N}) = - \frac{3}{4} \ln (1 - \frac{4 p_{d i f f}}{3}) (7)$

The variance of D_JC69 (designated as V_JC69) is obtained as the negative reciprocal of the second derivative of lnL:

$V_{J C 69} = - \frac{1}{\frac{d^{2} \ln L}{d D_{J C 69}^{2}}} = \frac{p_{d i f f} (1 - p_{d i f f})}{L {(1 - \frac{4 p_{d i f f}}{3})}^{2}} (8)$

Note that V_JC69 decreases with sequence length L as one would have expected. We illustrate the application of Eqs. (5) and (8) by using the aligned sequences in Figure 2 where N = 24, N_D = 6, and p_diff = 6/24 = 0.25. So D_JC69 = 0.3041, and var (D_JC69) = 0.0176.

Figure 2: Two homologous sequences for illustrating computation of pairwise evolutionary distances. View Figure 2

The equilibrium frequencies of the π vector can be derived by set t = ∞ in Eqs. (2) and (3) which leads to p_ii = p_ij = ¼. This implies that equilibrium frequencies of the four nucleotides will be equal for the JC69 model. This is not surprising because the frequencies did not even appear in the rate matrix (Figure 1a).

K80 model

The K80 model has a transition substitution rate α and a transversion rate β (Figure 3a). We will focus on nucleotide A and conceptualize the model with two events (Figure 3b), in contrast to only one event in the JC69 model. The first event (e₁) occurs when nucleotide A changes into any of the four nucleotides (including to itself). In other words, the original A is replace by a nucleotide randomly drawn from a nucleotide pool with equal nucleotide frequencies. This event occurs with a rate β. The second event (e₂) occurs when nucleotide A changes either to G or to itself, i.e., the original A is replace by a nucleotide randomly drawn from a purine pool with equal number of A and G. This e₂ occurs with a rate γ. Thus the transition rate α equals β+γ according to this conceptualization. Note that, whenever e₁ happens, the original nucleotide is replace by any one of the four nucleotides with equal probability, no matter how many e₂ events has occur before or after the occurrence of e₁. It might help to think of a long sequence with L sites being A at time 0. If these L sites each have experienced at least one e₂ event, then these sites will either be A or G with equal probability (i.e., 0.5), and we expect to have L/2 sites being A and the other L/2 sites being G. In contrast, if each of these L sites has experienced at least one e₁ event, then the site will be replaced by either A, C, G, or T with equal probability, and we expect to observe A, C, G and T in L/4 sites each. Any e₂ events occurring before or after the e₁ event do not change this expectation. This means that e₁ erases e₂, but not vice versa. The probability that an e₂ event happened is informative only when no e₁ event has happened.

Figure 3: Derivation of transition probabilities and the evolutionary distance (D) based on the K80 model. The rate matrix (a) has the diagonal elements (d) constrained by the row sum equal to 0, i.e., d = -α - 2β. P and Q are the observed proportion of transitional and transversional changes between two aligned homologous sequences. Equating them to their respective expected values, E(P) and E(Q), leads to the solution of αt and βt shown, and the evolutionary distance D. P(j|i,t) means the probability of changing from the original nucleotide i to nucleotide j after time t, and is synonymous to p_ij(t) or simply p_ij in this paper. View Figure 3

After time t, the expected number of substitutions is 2(α+β)t, i.e., the nucleotide A has two ways of change with a rate of α (to A and to G) and another two ways of change with a rate β (to C and to T), so the probability of no change, according to Poisson distribution, is

$p (e_{1} = 0, e_{2} = 0, t) = e^{- 2 (α + β) t} (9)$

Note that α is conceptualized as (β+γ) in Figure 3, so e^-2(α+β)t in Eq. (9) is equivalent to e^-(4β+2γ)t. The probability that at least one e₁ event has occurred is

$p (e_{1} > 0, t) = 1 - e^{- 4 β t} (10)$

Thus, the probability that at least one e₂ has occurred but e₁ has not occurred is simply

$p (e_{2} > 0, e_{1} = 0, t) = 1 - p (e_{1} = 0, e_{2} = 0, t) - p (e_{1} > 0, t)$ $= 1 - e^{- 2 (α + β) t} - (1 - e^{- 4 β t})$ $= e^{- 4 β t} - e^{- 2 (α + β) t} (11)$

These probabilities are also shown in Figure 3c. The reason for the condition that “e₁ has not occurred” is because e₁ event can erase e₂ event (as we have discussed before).

Now the probability of the starting nucleotide A changing to G during time t, designated as p(G|A,t), is the summation of two probabilities. The first is 1/2 of the probability of p(e₂ > 0, e₁ = 0,t) in Eq. (11) because the other 1/2 is for A to itself. The second is 1/4 of p(e₁ > 0,t) in Eq. (10) because A→A, A→G, A→C and A→T each get ¼, so only 1/4 of p(e₁ > 0,t) is for A→G. The summation of these two probabilities (Figure 3d) is p(G|A,t). This probability is equal to p(A|G,t), p(C|T,t), and p(T|C,t) in the K80 model. In other words, the summation of these two probabilities is the probability of a transition (P_s) during time t. Thus,

$P_{s} = \frac{p (e_{2} > 0, e_{1} = 0, t)}{2} + \frac{p (e_{1} > 0, t)}{4}$ $= \frac{e^{- 4 β t} - e^{- 2 (α + β) t}}{2} + \frac{1 - e^{- 4 β t}}{4}$ $= \frac{1}{4} + \frac{e^{- 4 β t}}{4} - \frac{e^{- 2 (α + β) t}}{2} = P (G | A, t) = P (A | G, t) = P (T | C, t) = P (C | T, t) (12)$

Similarly, the probability of the starting A changing to C (or to T) is 1/4 of p(e₁ > 0,t) in Eq. (10) because 1/4 is for A→A, ¼ is for A→G and 1/4 is for A→T, so only 1/4 is for A→C (Figure 3d). This probability is the probability for a transversional change during time t,

$P_{v} = \frac{p (e_{1} > 0, t)}{4} = \frac{1 - e^{- 4 β t}}{4} (13)$

As a quick check of the derived transition probabilities, we note that P_s and P_v are zero when t = 0 (or when α = 0 and β = 0). This also implies that all diagonal elements in the transition probability matrix are equal to 1, and is what we have expected. When t = ∞, with α > 0 and β > 0, all entries in matrix P approaches ¼ (the equilibrium frequency of the K80 model). This is also what we expected.

For two aligned homologous sequences, P_s can be approximated by the proportion of sites differing by a transition (P), and 2P_v by the portion of sites differing by a transversion (Q, Figure 3e). Note that the expected Q is equal to 2P_v because there are two ways of having a transversional change. Therefore,

$P = \frac{1}{4} + \frac{e^{- 4 β t}}{4} - \frac{e^{- 2 (α + β) t}}{2} (14)$

$Q = 2 P_{v} = 2 (\frac{1 - e^{- 4 β t}}{4}) = \frac{1 - e^{- 4 β t}}{2} (15)$

We can now first solve for βt from Eq. (15), and then substitute the solution for βt into Eq. (14) to solve for αt. This leads to

$α t = - \frac{\ln (1 - 2 P - Q)}{2} + \frac{\ln (1 - 2 Q)}{4}$ $β t = - \frac{\ln (1 - 2 Q)}{4} (16)$

Recall that evolutionary distance is defined as μt, where μ is the substitution rate which is equal to (α+2β) in the K80 model. Thus, the evolutionary distance based on the K80 model (D_K80) is (α+2β)t, which comes to

$D_{K 80} = α t + 2 β t = - \frac{\ln (1 - 2 P - Q)}{2} - \frac{\ln (1 - 2 Q)}{4} (17)$

Where P and Q can be approximated by the observed proportion of sites differing by a transition or a transversion from two aligned sequences, designated as P_obs and Q_obs. Similar to what I have mentioned with reference to D_JC69, P and Q may differ from P_obs and Q_obs even if the K80 model is followed during the sequence evolution. This is because 1) Limited aligned length of the two sequences may result in stochastic variation in P_obs and Q_obs, and 2) The two observed proportions may be distorted by alignment errors (i.e., misidentification of site homology). For example, two homologous sequences that have diverged for an infinite length of time according to the K80 model should have expected P and Q equal to 0.25 and 0.5, respectively. However, we may actually have P_obs > 0.25 or Q_obs > 0.5, which would render D_K80 inapplicable. On the other hand, after sequence alignment and deletion of indels (because evolutionary distances are typically calculated without using sites with indels), P_obs and Q_obs may well be much smaller than the expected 0.25 and 0.5 leading to severer underestimation of the true distance. It is also possible to have P_obs and Q_obs values that, when used to replace P and Q in Eq. (16), result in negative αt or βt values that make no biological sense. The same applies to D_JC69 (in fact to any evolutionary distances based on a substitution model). Methods for handling such situations are discussed later in the section on the GTR model.

We may optionally show D_K80 in Eq. (17) to be a maximum likelihood estimator of the distance based on the K80 model, just like the K_JC69 distance in Eq. (5). To see this, it is better to re-parameterize the K80 model by replacing αt and βt by D_K80 and κ using the following relationship:

$\begin{array}{l} D_{K 80} = α t + 2 β t (18) \\ κ = α t / β t \end{array}$

Solving these two equations gives us

$\begin{array}{l} α t = \frac{D_{K 80} κ}{κ + 2} (19) \\ β t = \frac{D_{K 80}}{κ + 2} \end{array}$

Substituting αt and βt into Eqs. (14) and (15) so that P and Q will be functions of D_K80 and κ, and the likelihood function for deriving D_K80 and κ is

$L = {(\frac{1}{4})}^{N} P^{N_{s}} Q^{N_{v}} {(1 - P - Q)}^{N - N_{s} - N_{v}}$ $\ln L = N \ln (\frac{1}{4}) + N_{s} \ln P + N_{v} \ln Q + (N - N_{s} - N_{v}) \ln (1 - P - Q) (20)$

Where the constant term N^*ln(1/4) can be dropped in maximizing lnL to obtain the distance estimate, but need to be kept when performing likelihood ratio test for comparing different substitution models (e.g., K80 against TN93).

Taking partial derivatives with respect to D_K80 and κ, setting them to zero and solving the simultaneous equations, we have

$D_{K 80} = - \frac{\ln (1 - 2 P_{o b s} - Q_{o b s})}{2} - \frac{\ln (1 - 2 Q_{o b s})}{4} (21)$

$κ = \frac{2 \ln (1 - 2 P_{o b s} - Q_{o b s})}{\ln (1 - 2 Q_{o b s})} - 1 (22)$

Where P_obs = N_s/N, and Q_obs = N_v/N. Using the two aligned sequences in Figure 2, we have N = 24 and P_obs = 4/24 and Q_obs = 2/24. These lead to D_K80 = 0.3151, and κ = 4.9126. It may be relevant to add that, while D_JC69 and D_K80 are maximum likelihood estimates, distance formulae for F84 and TN93 models, obtained in the same way by equating the observed substitutions to expected substitutions, are generally not maximum likelihood estimates. This will become clear when we deal with these models.

We have previously derived the variance of D_JC69 as the negative reciprocal of the second derivative of lnL with respect to D_JC69. This can be used only when the log-likelihood function is used to estimate a single parameter. When there are multiple parameters (e.g., D_K80 and κ), we cannot use the same approach unless the parameters are not correlated. There are two commonly used methods for deriving variances of parameters. The first is the delta method [7-9], and the second uses the Fisher information matrix to obtain the variances and covariance matrix for the parameters. The delta method, which often yields nice and clean mathematical expressions for the variance, is illustrated in the Appendix. The method using the Fisher information matrix is shown below.

To estimate variance involving multiple parameters such as D_K80 and κ, we first take the second order partial derivatives of lnL with respective to D_K80 and κ, substituting the estimated D_K80 and κ in Eqs. (21) and (22) into the second-order partial derivatives, arranging them into what is called a Fisher information matrix (M_FI) below, and compute the matrix inverse of M_FI (designated by M_FI ^-1):

$M_{F I} = [\begin{matrix} - \frac{\partial^{2} \ln L}{\partial κ^{2}} & - \frac{\partial^{2} \ln L}{\partial κ \partial D_{K 80}} \\ - \frac{\partial^{2} \ln L}{\partial D_{K 80} \partial κ} & - \frac{\partial^{2} \ln L}{\partial D_{K 80}^{2}} \end{matrix}] (23)$

The diagonal elements of M_FI^-1 are the variances for κ and D_K80, and the off-diagonal elements of M_FI^-1 are covariances. The mathematical expression for the variance of κ is tedious, but that for the variance of D_K80 is simpler:

$V (D_{K 80}) = \frac{a^{2} P + c^{2} Q - {(a P + c Q)}^{2}}{N}, where$ $a = \frac{1}{1 - 2 P - Q}, b = \frac{1}{1 - 2 Q}, c = \frac{a + b}{2} (24)$

With the aligned sequences in Figure 2, we have N = 24 and empirical P = 4/24 and Q = 2/24. These lead to D_K80 = 0.3151, and κ = 4.9126. The M_FI and M_FI ^-1 are

$M_{F I} = [\begin{matrix} 0.0 47435 & - 0.286795 \\ - 0.286795 & 49.641105 \end{matrix}]$ $M_{F I}^{- 1} = [\begin{matrix} 21 .84451668 & 0 .126204037 \\ 0 .126204037 & 0 .020873724 \end{matrix}] (25)$

Where the two parameters are in the order of κ and D_K80, i.e., the variance is 21.8445 for κ and 0.0209 for D_K80. The off-diagonal elements are covariances between the two parameters.

F84 and HKY85 model

The F84 and HKY85 model accommodate not only the differential substitution rates between transitions and transversions, but also different equilibrium nucleotide frequencies, in contrast to JC69 and K80 which assume equal equilibrium nucleotide frequencies. The same probabilistic reasoning used before can be applied to derive transition probabilities for the HKY80 model.

The rate matrix for the F84 model is

$Q_{F 84} = \begin{matrix} A \\ G \\ C \\ T \end{matrix} [\begin{matrix} - & β π_{G} + γ π_{G} / π_{R} & β π_{C} & β π_{T} \\ β π_{A} + γ π_{A} / π_{R} & - & β π_{C} & β π_{T} \\ β π_{A} & β π_{G} & - & β π_{T} + γ π_{T} / π_{Y} \\ β π_{A} & β π_{G} & β π_{C} + γ π_{C} / π_{Y} & - \end{matrix}] (26)$

Where π_A, π_G, π_C and π_T are equilibrium frequencies, π_R and π_Y are frequencies of purines and pyrimidines, and the diagonal elements are constrained by each row summing up to 0. The parameter γ in Eq. (26) is sometimes replaced by κβ, but it is easier to understand the F84 model by using Q_F84 specified in Eq. (26).

We may view the F84 model as featuring two events (e₁ and e₂). Suppose we start with a nucleotide A. Event e₁ occurs with rate β. When it occurs, the original A will be replaced by a nucleotide drawn randomly from a nucleotide pool in which the nucleotide frequencies are the same as the equilibrium frequencies. This means that the original A has a rate of βπ_A, βπ_G, βπ_C and βπ_T to change to A, G, C and T, respectively, when e₁ occurs. This is different from the K80 model where, when e₁ occurs, the original A has a rate of 0.25 to change to any of the four nucleotides. Event e₂ has a rate of γ to occur, and will result in the original A being replaced by a purine drawn randomly from a purine pool with A and G frequencies specified as π_A/R and π_G/R. Thus, the original A has a rate of γπ_A/π_R and γπ_G/π_R to change to A and G when e₂ occurs. This again differs from K80 where the original A has a rate of 0.5 of changing to A or Gwhen e₂ occurs. These events are illustrated in Figure 4a, where we use x to represent β+γ/π_R. Note that it is not a good idea to use α to represent β+γ/π_R for two reasons. First, if we had started with a nucleotide C or T instead of A, then we would have β+γ/π_Y instead of β+γ/π_R which would force us to use α₁ and α₂ to distinguish between the two. A casual reader will then be misled to think that F84 has three rate parameters (i.e., β, α₁ and α₂) without knowing that α₁ and α₂ are used as different functions of the same rate parameter γ. Second, I have reserved α to represent β+γ in Figure 4b which simplifies the derivation of transition probabilities illustrated in Figure 4.

Figure 4: Derivation of transition probabilities based on the F84 model. π_A, π_G,π_C, π_T and π_R, π_Y are equilibrium frequencies of A, C, G, T, purine (A + G) and pyrimidine (C + T). Event e₁ occurs at a rate of β and leads to the original A being replaced by any of the four nucleotides according to their equilibrium frequencies, and event e₂ occurs at a rate of γ and results in the original A being replaced by either A or G according to their frequencies in the purine pool, i.e., π_A/π_R, and π_G/π_R. I used x as a shorthand for β+γ/π_R. The rate γ does not appear in the final transition probabilities because it has been absorbed into α which equals β+γ shown in (b). P(j|i,t) means the probability of changing from the original nucleotide i to nucleotide j after time t, and is synonymous to p_ij(t) or simply p_ij in this paper. View Figure 4

Note that whenever event e₁ happens, the original A is replace by A, C, G and T with probabilities π_A, π_G, π_C and π_T, no matter how many e₂ events has occur before or after the occurrence of e₁. This is similar to the scenario involving the K80 model, except that the K80 model assumes equal nucleotide frequencies. It might help to think of a long sequence with L sites being A at time 0. If these L sites each have experienced at least one e₂ event, then these sites will either be A or G with probabilities π_A and π_G, respectively, and we expect to have π_AL sites being A and π_GL sites being G. In contrast, if each of these L sites has experienced at least one e₁ event, then the site will be replaced by A, C, G, or T with probabilities π_A, π_G, π_C and π_T, and we expect to observe A, C, G and T in π_AL, π_GL, π_CL, and π_TL sites, respectively. Any number of e₂ events occurring before or after the e₁ event does not change this expectation. This means that e₁ erases e₂, but not vice versa. The occurrence of an e₂ event is informative only when no e₁ event has happened.

After time t, the total flow of the original A to the four nucleotides (including itself, Figure 4a and Figure 4b) is

$π_{A} x + π_{G} x + β π_{C} + β π_{T} = π_{R} x + π_{Y} β = π_{R} (β + γ / π_{R}) + π_{Y} β = β + γ = α (27)$

So the probability that no substitution has happened during time t (Figure 4c), according to Poisson distribution, is

$p (e_{1}, e_{2} = 0, t) = e^{- α t} (28)$

The rate of A changing to A, G, C, and T through e₁ is βπ_A + βπ_G + βπ_C + βπ_T = β, so the probability that at least one e₁ has occurred during time t is

$p (e_{1} > 0, t) = 1 - e^{- β t} (29)$

The probability that e₂ has happened but e₁ has not is then

$p (e_{2} > 0, e_{1} = 0, t) = 1 - p (e_{1}, e_{2} = 0, t) - p (e_{1} > 0, t) = e^{- β t} - e^{- α t} (30)$

The reason for the condition that “e₁ has not occurred” is because e₁ event can erase e₂ event. With these, it is easy to derive transition probability from A to G (Figure 4d) as the summation of 1) A fraction of π_G of p(e₁ > 0,t), which is the probability of e₁ event that results in the original A being replaced by A, C, G, and T with probabilities π_A, π_G, π_C and π_T, and 2) A fraction of π_G/π_R of p(e₂ > 0,e₁ = 0,t), which is the probability that e₂ events not erased by e₁. That is,

$p (G | A, t) = p (e_{1} > 0, t) π_{G} + \frac{p (e_{2} > 0, e_{1} = 0, t) π_{G}}{π_{R}} = π_{G} + \frac{π_{G} π_{Y} e^{- β t}}{π_{R}} - \frac{π_{G} e^{- α t}}{π_{R}} (31)$

From now on, p(j|i,t) will be written simply as p_ij, so p(G|A,t) is p_AG. With the same reasoning, we can derive transition probabilities for other A↔G and C↔T substitutions. Note that the two rate parameters in the F84 model (β and γ) have been re-parameterized into α (= β + γ) and β in Eq. (31). The transition probability from the original A to C (a transversion, Figure 4e) is simply

$p_{A C} = π_{C} p (e_{1} > 0, t) = π_{C} (1 - e^{- β t}) (32)$

For other transversions, e.g., p_AT, one just need to replace π_C by π_T. The complete transition probability matrix for the F84 model is

$P_{F 84} = \begin{matrix} A \\ G \\ C \\ T \end{matrix} [\begin{matrix} π_{A} + π_{A} π_{Y} x_{1} + π_{G} x_{2} & π_{G} + π_{G} π_{Y} x_{1} - π_{G} x_{2} & π_{C} (1 - e^{- β t}) & π_{T} (1 - e^{- β t}) \\ π_{A} + π_{A} π_{Y} x_{1} - π_{A} x_{2} & π_{G} + π_{G} π_{Y} x_{1} + π_{A} x_{2} & π_{C} (1 - e^{- β t}) & π_{T} (1 - e^{- β t}) \\ π_{A} (1 - e^{- β t}) & π_{G} (1 - e^{- β t}) & π_{C} + π_{C} π_{R} x_{3} + π_{G} x_{4} & π_{T} + π_{T} π_{R} x_{3} - π_{T} x_{4} \\ π_{A} (1 - e^{- β t}) & π_{G} (1 - e^{- β t}) & π_{C} + π_{C} π_{R} x_{3} - π_{C} x_{4} & π_{T} + π_{T} π_{R} x_{3} + π_{C} x_{4} \end{matrix}] (33)$

Where

$x_{1} = \frac{e^{- β t}}{π_{R}}, x_{2} = \frac{e^{- α t}}{π_{R}}, x_{3} = \frac{e^{- β t}}{π_{Y}}, x_{4} = \frac{e^{- α t}}{π_{Y}} (34)$

As a quick check of the transition probabilities, we first note that when t = 0 (or when α = 0 and β = 0), then the diagonal elements are 1 and all off-diagonal elements are 0, which is what we expected. Second, when t = ∞ with α > 0 and β > 0, then the transition probabilities will approach the equilibrium frequencies, which is also what we expected.

To obtain the distance for the F84 model (D_F84), recall that a distance is defined as μt where μ is the average substitution rate, i.e., substitution rates in Eq. (26) weighted by the equilibrium frequencies:

$D_{F 84} = 2 π_{A} π_{G} (β t + γ t / π_{R}) + 2 π_{T} π_{C} (β t + γ t / π_{Y}) + 2 π_{Y} π_{R} β t (35)$

Now we need to obtain βt and γt in order to calculate D_F84. We can obtain αt and βt, and then obtain γt = αt - βt, remembering that α = β + γ (Figure 4b and Eq. (27)]. The method we will use is the same as that for the K80 model, i.e., we obtain the expected transitions and transversions, designated E(S) and E(V), respectively, from transition probabilities and equate them to the observed S and V to solve for αt and βt. With the property of time reversibility (e.g., π_A•p_AG = π_G•pGA), we have

$E (S) = 2 π_{A} p_{A G} + 2 π_{C} p_{C T}$ $E (V) = 2 π_{A} p_{A T} + 2 π_{A} p_{A C} + 2 π_{G} p_{G C} + 2 π_{G} p_{G T} (36)$

Equating E(S) and E(V) to the observed S and V, and solving these two equations with the two unknowns (αt and βt), we have

$α t = \ln (\frac{- 2 (π_{A} π_{G} π_{R} π_{Y}^{2} + π_{C} π_{T} π_{R}^{2} π_{Y})}{S π_{R}^{2} π_{Y}^{2} - 2 π_{A} π_{G} π_{R} π_{Y}^{2} - 2 π_{C} π_{T} π_{R}^{2} π_{Y} + (π_{A} π_{G} π_{Y}^{2} + π_{C} π_{T} π_{R}^{2}) V}) (37)$

$β t = - \ln (1 - \frac{V}{2 π_{R} π_{Y}}) (38)$

Substitute βt and γt (= αt - βt) into Eq. (35) and, after some algebraic manipulation, we have a more useful form of D_F84:

$D_{F 84} = \frac{2}{π_{R} π_{Y}} [- (π_{A} π_{G} + π_{C} π_{T}) π_{R} π_{Y} \ln (x_{1}) + π_{C} π_{T} π_{R} \ln (\frac{x_{2}}{x_{3}}) + π_{A} π_{G} π_{Y} \ln (\frac{x_{2}}{x_{3}}) - π_{R}^{2} π_{Y}^{2} \ln (x_{1})] (39)$

Where

$x_{1} = 1 - \frac{V}{2 π_{R} π_{Y}}$

$x_{2} = (π_{A} π_{G} π_{Y} + π_{C} π_{T} π_{R}) (2 π_{R} π_{Y} - V)$

$x_{3} = - S π_{R}^{2} π_{Y}^{2} + 2 π_{A} π_{G} π_{R} π_{Y}^{2} + 2 π_{C} π_{T} π_{Y} π_{R}^{2} - π_{A} π_{G} π_{Y}^{2} V - π_{C} π_{T} π_{R}^{2} V (40)$

To illustrate the calculation of D_F84, we may use the two aligned sequences in Figure 2 which gives us π_A = 6/48, π_C = 12/48, π_G = 10/48, π_T = 20/48, S = 4/24, V = 2/24, αt = 0.5778363341, βt = 0.2076393648, γt = αt - βt = 0.3701969693, and D_F84 = 0.3198867427. The variance of the D_F84 can be obtained by either the delta method or the method using Fisher information matrix.

A substitution model similar to the F84 model is the HKY85 model, with its rate matrix specified as:

$Q_{H K Y 85} = \begin{matrix} A \\ G \\ C \\ T \end{matrix} [\begin{matrix} - & (β + γ) π_{G} & β π_{C} & β π_{T} \\ (β + γ) π_{A} & - & β π_{C} & β π_{T} \\ β π_{A} & β π_{G} & - & (β + γ) π_{T} \\ β π_{A} & β π_{G} & (β + γ) π_{C} & - \end{matrix}] (41)$

Where (β+γ) is often written as α and the diagonal elements are constrained by each row summing up to 0. The HKY85 model and the F84 model differ only in the specification of rates involving transitions. qAG and qCT are π_G(β+γ) and π_T(β+γ) in the HKY85 model specified in Eq. (41), in contrast to π_G(β+γ/π_R) and π_T(β+γ/π_Y), respectively, in the F84 model specified in Eq. (26). By comparing these rates, it becomes obvious that the F84 model would be equivalent to the HKY85 model if π_R = π_Y.

We can obtain the transition probabilities for the HKY85 model in the same way as that for the F84 model. In short, we again start with a nucleotide A and envision two events e₁ and e₂. Event e₁ occurs with rate β, and results in the original A replaced by any of the four nucleotides with probabilities equal to their respective equilibrium frequencies. Event e₂ occurs with a rate γ and results in the original A being replaced by either A or G with the probabilities equal to their respective equilibrium frequencies. Fictionalized in this way, the expected number of substitutions after time t is β(π_A+π_G+π_C+π_T) + γ(π_A+π_G) = β+γR. According to the Poisson distribution, the probability that no substitution has happened during time t is

$p (e_{1}, e_{2} = 0, t) = 1 - e^{- (β + γ R)} (42)$

The probability that at least one e₁ occurred after time t is

$p (e_{1} > 0, t) = 1 - e^{- β t} (43)$

The probability that e₂ has occurred but e₁ has not is

$p (e_{2} > 0, e_{1} = 0, t) = 1 - p (e_{1}, e_{2} = 0, t) - p (e_{1} > 0, t) = e^{- β t} - e^{- (β + γ R)} (44)$

The transition probability p(G|A,t), abbreviated as p_AG, is

$p_{A G} = π_{G} p (e_{1} > 0, t) + \frac{π_{G}}{π_{R}} p (e_{2} > 0, e_{1} = 0, t) = π_{G} + \frac{π_{G} π_{Y} e^{- β t}}{π_{R}} - \frac{π_{G} e^{- (β + π_{R} γ) t}}{π_{R}} (45)$

In the same way, we can derive other transition probabilities which are shown below:

$P_{H K Y} = \begin{matrix} A \\ G \\ C \\ T \end{matrix} [\begin{matrix} π_{A} + π_{A} x_{1} + π_{G} x_{2} & π_{G} + π_{G} x_{1} - π_{G} x_{2} & π_{C} (1 - e^{- β t}) & π_{T} (1 - e^{- β t}) \\ π_{A} + π_{A} x_{1} - π_{A} x_{2} & π_{G} + π_{G} x_{1} + π_{A} x_{2} & π_{C} (1 - e^{- β t}) & π_{T} (1 - e^{- β t}) \\ π_{A} (1 - e^{- β t}) & π_{G} (1 - e^{- β t}) & π_{C} + π_{C} x_{3} + π_{T} x_{4} & π_{T} + π_{T} x_{3} - π_{T} x_{4} \\ π_{A} (1 - e^{- β t}) & π_{G} (1 - e^{- β t}) & π_{C} + π_{C} x_{3} - π_{C} x_{4} & π_{T} + π_{T} x_{3} + π_{C} x_{4} \end{matrix}] (46)$

Where

$x_{1} = \frac{π_{Y} e^{- β t}}{π_{R}}; x_{2} = \frac{e^{- (β + π_{R} γ) t}}{π_{R}}; x_{3} = \frac{π_{R} e^{- β t}}{π_{Y}}; x_{4} = \frac{e^{- (β + π_{Y} γ) t}}{π_{Y}} (47)$

As a quick check of the transition probabilities, we first note that when t = 0 (or when α = 0 and β = 0), then the diagonal elements are 1 and all off-diagonal elements are 0, which is what we expected. Second, when t approaches infinity with β > 0 and γ > 0, then the transition probabilities will approach the equilibrium frequencies, which is also what we expected.

We cannot derive the distance for the HKY85 model by following the same approach as that for the F84 model. Hasegawa, et al. [4] has tried this approach but were not successful because there is no explicit solution for βt and γt. However, if we treat the A↔G transition and C↔T transition separate, then we can solve for βt and γt [10]. In other words, we obtain one set of βt and γt from observed A↔G transitions and transversions, and another set of βt and γt from observed C↔T transitions and transversions. βt in the two sets are the same as that in Eq. (38), but γt is different between the two sets of estimates. We can then take a weighted average of γt. Admittedly, this does sound mathematically clumsy and explains why HKY85, while commonly used in phylogenetic analysis involving a likelihood framework or Bayesian inference, is almost never used in distance-based phylogenetics.

Here is the somewhat circuitous protocol to get βt and γt from HKY85. The expected numbers of A↔G and C↔T transitions, designated S_R and S_Y, respectively, and transversions are

$E (S_{R}) = 2 π_{A} p_{A G}$

$E (S_{Y}) = 2 π_{C} p_{C T}$

$E (V) = 2 π_{A} p_{A T} + 2 π_{A} p_{A C} + 2 π_{G} p_{G C} + 2 π_{G} p_{G T} (48)$

Setting E(S_R) and E(V) to their the observed S_R and V, and solve for βt and γt, we have

$β t = - \ln (1 - \frac{V}{2 π_{R} π_{Y}})$

$γ_{R} t = \frac{1}{π_{R}} \ln (\frac{π_{A} π_{G} (2 π_{R} π_{Y} - V)}{2 π_{A} π_{G} π_{R} π_{Y} - S_{R} π_{Y} π_{R}^{2} - π_{A} π_{G} π_{Y} V}) (49)$

Where βt is the same as that in Eq. (38), and γ_Rt in Eq. (49) is γt estimated from observed S_R and V.

Now we obtain another set of solutions for βt and γt by setting E(S_Y) and E(V) to their observed S_Y and V, and solve for βt and γt, we have the same βt but a different γt:

$γ_{Y} t = \frac{1}{π_{Y}} \ln (\frac{π_{C} π_{T} (2 π_{R} π_{Y} - V)}{2 π_{C} π_{T} π_{R} π_{Y} - S_{Y} π_{R} π_{Y}^{2} - π_{C} π_{T} π_{R} V}) (50)$

A weighted average of γt could be

$γ t = π_{R} γ_{R} t + π_{Y} γ_{Y} t (51)$

The distance for the HKY model

$D_{H K Y 85} = μ t = 2 π_{A} π_{G} (β t + γ t) + 2 π_{T} π_{C} (β t + γ t) + 2 π_{Y} π_{R} β t (52)$

To compute D_HY85 using the two aligned sequences in Figure 2, we have π_A = 6/48, π_C = 12/48, π_G = 10/48, π_T = 20/48, S_Y = 4/24, S_R = 0, V = 2/24, βt = 0.2076393648, γ_Rt = -0.2223239164, γ_Yt = 1.047432870, weighted γt = 0.624180608, D_HKY85 = 0.308904. I intentionally choose the aligned sequences in Figure 2 with S_R = 0 just to see if D_HKY85 would behave strangely. It did not. For comparison, the same two sequences yield D_F84 = 0.319887.

In general, D_HKY85 is slightly smaller than D_F84. I used the eight vertebrate COI sequences in the FASTA file VertCOI.fas that comes with DAMBE [11] to compute both D_HKY85 and D_F84 (Figure 5). The difference is minor, although D_HKY85 is consistently but slightly smaller than D_F84.

Figure 5: Evolutionary distances from the HKY85 and F84 models are nearly identical. View Figure 5

The HKY85 model itself may not carry much biological significance given the existence of the F84 model. However, the twists involved in computing the evolutionary distance, i.e., the separate estimation of γA↔G and γC↔T, lead very naturally to a very useful TN93 model that we will cover next.

TN93 model

We have come far, so far that we need hardly any extra effort to derive transition probabilities for the TN93 model. There are two equivalent specifications of the rate matrix for the TN93 model. The first is

$Q_{T N 93} = \begin{matrix} A \\ G \\ C \\ T \end{matrix} [\begin{matrix} - & β π_{G} + γ_{R} π_{G} / π_{R} & β π_{C} & β π_{T} \\ β π_{A} + γ_{R} π_{A} / π_{R} & - & β π_{C} & β π_{T} \\ β π_{A} & β π_{G} & - & β π_{T} + γ_{Y} π_{T} / π_{Y} \\ β π_{A} & β π_{G} & β π_{C} + γ_{Y} π_{C} / π_{Y} & - \end{matrix}] (53)$

Where the diagonal elements are constrained by each row summing up to 0. The second specification simply replaces (β + γR/π_R) by α₁ and (β + γY/π_Y) by α₂. We see that TN93 is reduced to F84 if γR = γY, and to HKY85 if γR / π_R = γY / π_Y.

The similarity between TN93 and F84 allows us to re-use Figure 4 for deriving transition probabilities for TN93. We only need to add a subscript R to γ and α in Figure 4 so that we have γ_R and α_R as rates for purine, keeping everything else the same, and we instantly obtain the transition probabilities for transitional substitutions between purines and for transversional substitutions as shown in Figure 4. To get transition probabilities between pyrimidines, we can just replace the original nucleotide A in Figure 4 by nucleotide C or T and rename γ and α in Figure 4 to γ_Y and α_Y. Note that our α_R = β+γ_R, and α_Y = β+γ_Y.

The transition probability matrix for the TN93 model is

$P_{T N 93} = \begin{matrix} A \\ G \\ C \\ T \end{matrix} [\begin{matrix} π_{A} + π_{A} π_{Y} x_{1} + π_{G} x_{2} & π_{G} + π_{G} π_{Y} x_{1} - π_{G} x_{2} & π_{C} (1 - e^{- β t}) & π_{T} (1 - e^{- β t}) \\ π_{A} + π_{A} π_{Y} x_{1} - π_{A} x_{2} & π_{G} + π_{G} π_{Y} x_{1} + π_{A} x_{2} & π_{C} (1 - e^{- β t}) & π_{T} (1 - e^{- β t}) \\ π_{A} (1 - e^{- β t}) & π_{G} (1 - e^{- β t}) & π_{C} + π_{C} π_{R} x_{3} + π_{G} x_{4} & π_{T} + π_{T} π_{R} x_{3} - π_{T} x_{4} \\ π_{A} (1 - e^{- β t}) & π_{G} (1 - e^{- β t}) & π_{C} + π_{C} π_{R} x_{3} - π_{C} x_{4} & π_{T} + π_{T} π_{R} x_{3} + π_{C} x_{4} \end{matrix}] (54)$

Where x₁ and x₃ are the same as those in Eq. (34), but x₂ has α replaced by α_R and x₄ has α replaced by α_Y, i.e.,

$x_{1} = \frac{e^{- β t}}{π_{R}}, x_{2} = \frac{e^{- α_{R} t}}{π_{R}}, x_{3} = \frac{e^{- β t}}{π_{Y}}, x_{4} = \frac{e^{- α_{Y} t}}{π_{Y}} (55)$

To obtain the distance for the TN93 model (D_TN93), recall that a distance is defined as μt where μ is the average substitution rate, i.e., substitution rates in Eq. (53) weighted by the equilibrium frequencies, so:

$D_{T N 93} = 2 π_{A} π_{G} (β t + γ_{R} t / π_{R}) + 2 π_{T} π_{C} (β t + γ_{Y} t / π_{Y}) + 2 π_{Y} π_{R} β t (56)$

Now we need to obtain α_Rt, &α_Yt, and βt. The method we will use is the same as that for the K80 and F84 models, i.e., we obtain the expected numbers of A↔G transitions, C↔T transitions, and transversions, designated E(S_R), E(S_Y) and E(V), respectively, from transition probabilities, and equate them to the observed S_R, S_Y and V to solve for α_Rt, &α_Yt, and βt:

$E (S_{R}) = 2 π_{A} p_{A G} = S_{R}$

$E (S_{Y}) = 2 π_{C} p_{C T} = S_{Y}$

$E (V) = 2 π_{A} p_{A T} + 2 π_{A} p_{A C} + 2 π_{G} p_{G C} + 2 π_{G} p_{G T} = V (57)$

The resulting α_Rt, &α_Yt, and βt are

$a_{R} t = \ln (\frac{2 π_{A} π_{G} π_{R} π_{Y}}{2 π_{A} π_{G} π_{R} π_{Y} - π_{R}^{2} π_{Y} S_{R} - π_{A} π_{G} π_{Y} V}) (58)$

$a_{Y} t = \ln (\frac{2 π_{C} π_{T} π_{R} π_{Y}}{2 π_{C} π_{T} π_{R} π_{Y} - π_{Y}^{2} π_{R} S_{Y} - π_{C} π_{T} π_{R} V}) (59)$

$β t = - \ln (1 - \frac{V}{2 π_{R} π_{Y}}) (60)$

If one wishes to express D_TN93 in S_R, S_Y and V, then one may just substitute γ_Rt, γ_Yt, and βt into Eq. (56), which yields:

$D_{T N 93} = \frac{2 π_{A} π_{G} [π_{Y} \ln (x_{1}) + \ln (x_{2})]}{π_{R}} + \frac{2 π_{C} π_{T} [π_{R} \ln (x_{1}) + \ln (x_{3})]}{π_{Y}} - 2 π_{R} π_{Y} x_{1} (61)$

Where

$x_{1} = 1 - \frac{V}{2 π_{R} π_{Y}}$

$x_{2} = \frac{2 π_{A} π_{G} π_{R} π_{Y}}{2 π_{A} π_{G} π_{R} π_{Y} - S_{R} π_{R}^{2} π_{Y} - π_{A} π_{G} π_{Y} V}$

$x_{3} = \frac{2 π_{C} π_{T} π_{R} π_{Y}}{2 π_{C} π_{T} π_{R} π_{Y} - S_{Y} π_{Y}^{2} π_{R} - π_{C} π_{T} π_{R} V} (62)$

To illustrate the application of D_TN93 with the two aligned sequences in Figure 2, we have π_A = 6/48, π_C = 12/48, π_G = 10/48, π_T = 20/48, S_Y = 4/24, S_R = 0, V = 2/24, α_Rt = 0.13353, &α_Yt = 0.90593, βt = 0.20764, γ_Rt = α_Rt – βt = -0.07411, γ_Yt = α_Rt – βt = 0.69829, D_TN93 = 0.35299. The variance of the D_TN93 can be obtained by either the delta method or the method using Fisher information matrix. Note that S_R = 0 means no information for estimating α_Rt properly.

I should mention that all distance formulations in this paper are known as Independently Estimated (IE) distances because they use information from only two aligned sequences and are independent of other pairs of sequences. Practical molecular phylogenetic analysis typically would use Simultaneously Estimated (SE) distances [12,13] which use information from all pairs of sequences. SE distances are implemented in MEGA [14] and DAMBE [11,15]. The PhyPA [16] function in DAMBE, which performs phylogenetic reconstruction base on pairwise alignment when reliable multiple sequence alignment is difficult to obtain for highly diverged sequences, uses SE distances only.

In short, the approach of deriving transition probabilities by probability reasoning can go a long way if one can do good bookkeeping. In particularly, the probability reasoning approach is very useful for conceptual understanding. However, the approach becomes increasing difficult with more complicated substitution models. Two alternative approaches, one involving solving differential equations and the other involving matrix exponential and logarithms, are often used in practical computation with the GTR model for nucleotide sequences and amino acid-based substitution models. They will be numerically illustrated elsewhere.

Acknowledgements

This study is funded by the Discovery Grant from Natural Science and Engineering Research Council of Canada (RGPIN/261252-2013). I thank C. Vlasschaert and S. Aris-Brosou for feedback.