Skip to content

Commit f268aef

Browse files
apragsdalepetrelharp
authored andcommitted
More carefully define two-locus haplotypes in docs
1 parent c1f8d80 commit f268aef

1 file changed

Lines changed: 35 additions & 7 deletions

File tree

docs/stats.md

Lines changed: 35 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -988,18 +988,46 @@ sets of samples (see also the note in {meth}`~TreeSequence.divergence`).
988988
##### One-way
989989

990990
The two-locus summary functions all take haplotype counts and sample set size
991-
as input. Each of our summary functions has the signature
991+
as input. Suppose that at the first site there are alleles
992+
{math}`(a_1, a_2, ...)`, and at the second site there are alleles
993+
{math}`(b_1, b_2, ...)`. For a pair of focal alleles {math}`a_i` and
994+
{math}`b_j`, we define two-locus counts
995+
{math}`(n(a_i,b_j), n(a_i,\sim b_j), n(\sim a_i, b_j))`, where
996+
{math}`n(a_i,b_j)` is the number of two-locus haplotypes in the sample set that
997+
carry both alleles {math}`a_i` and {math}`b_j`,
998+
{math}`n(a_i,\sim b_j)` is the number that carry the allele {math}`a_i`
999+
and do not carry the allele {math}`b_j`, and
1000+
{math}`n(\sim a_i, b_j)` is the number that carry the allele {math}`b_j`
1001+
and do not carry the allele {math}`a_i`. That is,
1002+
{math}`n(\sim a_i, b_j) = \sum_{k\not=i} n(a_k, b_j)`, and
1003+
{math}`n(a_i, \sim b_j) = \sum_{l\not=j} n(a_i, b_l)`.
1004+
1005+
We informally refer to focal alleles as {math}`A,B` and the above sets of
1006+
haplotypes as {math}`(AB, Ab, aB)`, so that {math}`Ab` refers to the set
1007+
of all haplotypes {math}`(a_i, \sim b_j)` and {math}`aB` refers to
1008+
{math}`(\sim a_i, b_j)`.
1009+
Their counts are labeled similarly: {math}`n_{AB} = n(A,B)`,
1010+
{math}`n_{Ab} = n(A, \sim B)`, and {math}`n_{aB} = n(\sim A, B)`.
1011+
Then each of our summary functions has the signature
9921012
{math}`f(n_{AB}, n_{Ab}, n_{aB}, n)`, converting to haplotype frequencies
993-
{math}`\{p_{AB}, p_{Ab}, p_{aB}\}` by dividing by {math}`n`. Below,
1013+
{math}`\{p_{AB}, p_{Ab}, p_{aB}\}` by dividing by the number {math}`n` of
1014+
samples in the sample set. Then
9941015
{math}`n_{ab} = n - n_{AB} - n_{Ab} - n_{aB}`, {math}`n_A = n_{AB} + n_{Ab}`
9951016
and {math}`n_B = n_{AB} + n_{aB}`, with frequencies {math}`p` found by dividing
9961017
by {math}`n`.
9971018

998-
Our convention is to use {math}`A,B` to denote derived alleles, and {math}`a,b`
999-
ancestral alleles (or other alleles, if the site is multi-allelic). For
1000-
polarised statistics, we average statistics over all non-ancestral alleles. For
1001-
unpolarised statistics, the labeling is arbitrary as we average over all
1002-
alleles (derived and ancestral).
1019+
For polarised statistics, we compute the statistic using all pairs of
1020+
non-ancestral alleles as focal alleles: so, we do not compute the summary
1021+
function with haplotype counts for which the focal alleles are the ancestral
1022+
allele at either of the two loci.
1023+
For unpolarised statistics, we compute the summary function over all
1024+
pairs of alleles. Thus, for polarised statistics, the summary function is
1025+
called {math}`(n_1-1)\times(n_2-1)` times, where {math}`n_1` and {math}`n_2`
1026+
are the total number of alleles at the first and second locus, respectively.
1027+
For unpolarised statistics, the summary function is called {math}`n_1 n_2`
1028+
times. The result is then averaged over the results computed for
1029+
each pair of focal alleles, using the specified weighting approach for a
1030+
given summary function.
10031031

10041032
`D`
10051033
: {math}`f(n_{AB}, n_{Ab}, n_{aB}, n) = p_{AB}p_{ab} - p_{Ab}p_{aB} \, (=p_{AB} - p_A p_B)`

0 commit comments

Comments
 (0)