**HKA**
-- A COMPUTER PROGRAM FOR TESTS OF NATURAL SELECTION -- DOCUMENTATION Jody Hey Department
of Genetics Rutgers
University Nelson
Biological Labs 604
Allison Rd. Piscataway,
NJ 08854-8082 732-445-5272 fax
732-445-5870 http://lifesci.rutgers.edu/~heylab * This computer program and documentation may be freely copied and used by anyone, provided no fee is charged for it.
_______________________ Contents
_______________________
______________________ Overview
_______________________
HKA is a computer program that carries out the widely used statistical test for natural selection that was developed by Hudson, R. R., M. Kreitman and M. Aguadé (1987 A test of neutral molecular evolution based on nucleotide data. Genetics 116: 153-159). This program can handle very large numbers of loci and sample sizes, and conducts tests via coalescent simulation as well as by the conventional chi square approximation. The simulations can also be used to conduct other tests of natural selection, including tests of Tajima's D statistic (1989) and the D statistic of Fu and Li (1993). RATIONAL The
HKA test is based on the most basic prediction of the neutral model of molecular
evolution, thatDNA sequence polymorphism within a species, and DNA sequence
divergence between species, will be proportional to the neutral mutation rate
(Kimura, 1968, 1969). To this basic
expectations we must add that polymorphism also depends on the effective
population size and that divergence also depends on the time since speciation.
When we consider data collected
from two species, and from multiple loci, we expect that all of the loci
within a species will share the same effective population size, and we expect
that each locus, regardless of species, will have a characteristic neutral
mutation rate (depending on its length and level of selective constraint). Thus
we can consider data on polymorphism and divergence cast in form of a
contingency table (Figure 1). We
can also build a similar table of the expected values, each a function of the
basic model parameters that are fit to the data. Those
parameters are: T
- the time since the species divergence f
- the ratio of the the two species effective population sizes. Theta_i
= 4N1 u_i - the population
mutation rate for locus i in species 1, where N1
- the effective population size of species 1 There
is one Theta parameter for every locus. Despite
the presence of observations and corresponding expectations in parallel tables,
one cannot conduct a conventional contingency table test (e.g. a
chi-square test). Such a test relies upon a sampling variance within each cell
that is roughly multinomial. However, the sampling variances within the cells of
an HKA table are those of the coalescent process.
Fortunately these can be calculated (Watterson, 1975).
Using these variances, Hudson et. al., (1987) developed a
chi-square-like test statistic, that they called
X^2. If divergence has been
sufficiently long, and polymorphism levels are not very low, then
X^2 should be approximately chi-square distributed with 2*K-2 degrees of
freedom where K is the number of loci in the test. Fig
1.
ASSUMPTIONS OF THE HKA TEST
ADDITIONAL FEATURES OF THE HKA PROTOCOL
The
basic HKA protocol has several attributes that
permit a variety of applications in addition to an overall test of departure
from the neutral model.
BASIC
FEATURES OF THE HKA PROGRAM
The program can handle a very large number of loci (default compilation is for up to 20), with sample sizes per species per locus up to 350 (192 on the PowerPC).
The sample size for the second species can be 1 sequence, as in the original implementation of the test (Hudson et al., 1987)
The program will conduct rapid coalescent simulations using parameters estimated from the data.
The program will conduct a variety of tests based on Tajima's D statistic (Tajima, 1989) and Fu and Li's D statistic (Fu and Li, 1993) ______________________ Downloadable Files Return to Contents ______________________
_______________________ Input File
Format Return to Contents _______________________ Input
Data File Format:
Thus
the basic format is of a table. With
K loci the table would have K rows. Each row begins with a locus name and is
followed by 13 numbers.
If one species has only a single sequence at one, at one or any of the loci, then that data set should be reduced to just one sequence for all of the loci for that species. In addition the species represented by just a single sequence for each locus should be species 2. Inclusion
of Tajima's and Fu&Li's statistics is optional and they need not be
included. However if they are the program will test their significance.
It will also conduct tests of the overall mean among these statistics and
of their variances among the loci. If
tests of D statistics are desired, but the values for some loci are not known or
are not calculable, then the corresponding value in the input table should be
set to -10 or less. _______________________ Running the
Program Return to Contents _______________________
The program file should reside either in the same folder as the data file or in a folder automatically searched by the operating system. The program can be run using command line parameters, or by simply typing the name of the program ('hka').
Command
line parameters: -D
(or -I) Input file name
e.g. -Dmydata -R
Output file name e.g. -RMyresults -S
Number of simulations e.g.
-S1000 (the maximum is 10000) -T
Do tests of Tajima's D statistic -F
Do tests of Fu and Li's D statistic -M
comment line e.g. -Mtest_my_data To
start the program with command line parameters: For
example, type: hka
-Dmydata -Rmyresults -S1000 -Mtest_my_data -T -F
To
start the program without command line parameters, simply enter 'hka'.
The program then asks for all of the necessary information
On a PowerPC, clicking on the program icon opens a small
window in which command line parameters can be entered (e.g. -Dmydata
-Rmyresults -S1000 -Mtest_my_data -T -F
_______________________ _______________________
_______________________ Program
Limitations Return to Contents _______________________
The program can handle very large data sets. The distributed version will handle 20 loci and up to 350 individuals per species per locus. When sample sizes are modest it does simulations quickly.
The most glaring limitation is that the program does not incorporate recombination into the simulations. Recombination reduces the variance of polymorphism levels, and for some purposes it would be nice to include this. _______________________ Literature Cited Return to
Contents _______________________ Fu,
Y. X., and W. H. Li, 1993 Statistical tests of neutrality of mutations. Genetics
133: 693-709.
Hudson, R. R., 1990 Gene genealogies and the coalescent process, pp. 1-44 in Oxford Surveys in Evolutionary Biology, edited by D.
Futuyma
and J. Antonovics. Oxford University Press, New York.
Hudson,
R. R., M. Kreitman and M. Aguadé, 1987 A test of neutral molecular evolution
based on nucleotide data. Genetics 116: 153-159.
Kimura,
M., 1968 Evolutionary rate at the molecular level. Nature 217: 624-626.
Kimura,
M., 1969 The number of heterozygous nucleotide sites maintained in a finite
population due to steady flux of mutations. Genetics 61: 893-903.
Tajima, F., 1989 Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123: 585-595. Wang, R. L., and J. Hey, 1996 The speciation history of Drosophila pseudoobscura and close relatives: inferences from DNA sequence variation at the period locus. Genetics 144: 1113-26.
Watterson,
G. A., 1975 On the number of segregating sites in genetical models without
recombination. Theor. Pop. Biol. 7: 256-275.
This page was last changed February 04, 2004 |