**SITES** -- DNA POLYMORPHISM ANALYSIS PROGRAM --

**HKA** -- A COMPUTER PROGRAM FOR TESTS OF NATURAL SELECTION --

DOCUMENTATION

Jody Hey

Department of Genetics

Rutgers University

Nelson Biological Labs

604 Allison Rd.

Piscataway, NJ� 08854-8082

732-445-5272

fax 732-445-5870

hey@biology.rutgers.edu

https://ccgg.temple.edu/heylab/home

�* This computer program and documentation may be freely copied and used by anyone, provided no fee is charged for it.

_______________________

�Contents

_______________________

Overview
Downloadable Files
Input File Format
Running the Program
Output
Program Limitations
Literature Cited

______________________

�Overview ��

_______________________

HKA is a computer program that carries out the widely used statistical test for natural selection that was developed by Hudson, R. R., M. Kreitman and M. Aguad� (1987 A test of neutral molecular evolution based on nucleotide data. Genetics 116: 153-159). This program can handle very large numbers of loci and sample sizes, and conducts tests via coalescent simulation as well as by the conventional chi square approximation. The simulations can also be used to conduct other tests of natural selection, including tests of Tajima's D statistic (1989) and the D statistic of Fu and Li (1993).

RATIONAL

The HKA test is based on the most basic prediction of the neutral model of molecular evolution, thatDNA sequence polymorphism within a species, and DNA sequence divergence between species, will be proportional to the neutral mutation rate (Kimura, 1968, 1969). To this basic expectations we must add that polymorphism also depends on the effective population size and that divergence also depends on the time since speciation. When we consider data collected from two species, and from multiple loci, we expect that all of the loci within a species will share the same effective population size, and we expect that each locus, regardless of species, will have a characteristic neutral mutation rate (depending on its length and level of selective constraint). Thus we can consider data on polymorphism and divergence cast in form of a contingency table (Figure 1). We can also build a similar table of the expected values, each a function of the basic model parameters that are fit to the data. Those parameters are:

T - the time since the species divergence

f - the ratio of the the two species effective population sizes.

Theta_i = 4N1 u_i - the population mutation rate for locus i in species 1, where N1 - the effective population size of species 1 and u_i is the neutral mutation rate at locus i.

There is one Theta parameter for every locus.

Despite the presence of observations and corresponding expectations in parallel tables, one cannot conduct a conventional contingency table test (e.g. a chi-square test). Such a test relies upon a sampling variance within each cell that is roughly multinomial. However, the sampling variances within the cells of an HKA table are those of the coalescent process. Fortunately these can be calculated (Watterson, 1975). Using these variances, Hudson et. al., (1987) developed a chi-square-like test statistic, that they called X^2. If divergence has been sufficiently long, and polymorphism levels are not very low, then X^2 should be approximately chi-square distributed with 2*K-2 degrees of freedom where K is the number of loci in the test.

Fig 1.

	Loci
Species	1	2	3	.	.	.
1	Polymorphism Species 1 Locus 1	Polymorphism Species 1 Locus 2	Polymorphism Species 1 Locus 3	.	.	.
2	Polymorphism Species 2 Locus 1	Polymorphism Species 2 Locus 2	Polymorphism Species 2 Locus 3	.	.	.
1 vs 2	Divergence Locus 1	Divergence Locus 2	Divergence Locus 3	.	.	.

ASSUMPTIONS OF THE HKA TEST

Constant population sizes since the time of common ancestry of the two species.
The size of the common ancestral species is either the average of that of the two descendant species or (if only one species is represented by multiple sequences) then the ancestor is assumed to be the same size as that descendant species for which multiple sequences have been sampled. This assumption is not critical if T is so large that there is little chance that any of the variation within species dates from the time of common ancestry.
No recombination within loci. This is a conservative assumption, as the presence of recombination will reduce the variance of the observations under the model.
Free recombination between loci. Again this is a conservative assumption. If the loci are not evolving independently then they assumption of stochastic variance among loci does not hold.
All mutations are selectively neutral.
The variance of departures of observations of polymorphisms and divergence, from expectations under the model, follow normal distributions. This assumption is required if the X^2 test statistic is to be well approximated by a chi square distribution. The assumption will hold approximately if Theta, T and sample sizes are large. The program conducts a test by simulation, and compares the observed distribution of the X^2 statistic with the simulated distribution. Thus failure of the assumption of normality of deviations will not lead to biased estimates of the distribution of the test statistic. However failure of the assumption of normality can be expected to reduce the power of the test.

ADDITIONAL FEATURES OF THE HKA PROTOCOL

The basic HKA protocol has several attributes that permit a variety of applications in addition to an overall test of departure from the neutral model.

Calculation of X^2 test statistic requires calculation, for each cell of the matrix, of departure between observation and expectation under the model. When just 2 loci are examined, these cells are highly non-independent and so these individual departure statistics are not very informative. But when many loci are included in the test, there arises scope for inferring which loci, and in what ways, the departure from neutrality occurs.
The HKA procedure generates estimates for each of several basic neutral model parameters that are necessary for calculating the expected values for the data set. Furthermore, since these parameters are estimated jointly from as many loci as are included in the test, they can be expected to have lower variance than comparable estimates that are based on individual loci. These parameter estimates can then also be used for other purposes.
The parameter estimates can be used to conduction simulations in which the null model assumptions of the test are taken to be correct. These are coalescent simulations, and it is a straightforward matter to have them follow the null model HKA assumptions exactly (Hudson, 1990).
Because the model underlying the HKA test is readily simulated, it is not difficult to generate new data sets from the same parameters as fit the actual data, in order to determine the actual distribution of X^2 under the model. In other words, if conditions are such that the X^2 may not follow a chi-square distribution, then there is no need to assume that it does. One can simply conduct the simulations and use the simulated distribution of X^2 for test purposes.
The simulated data sets, that are based on parameters estimated from the actual data, can also be used to examine other features of the data. For example, the computer program will conduct tests of Tajima's and Fu and Li's D statistics (Tajima, 1989; Fu and Li, 1993). These tests begin with a calculation of the difference between two distinct estimates (i.e. using different estimators) of the parameter Theta, and proceed to consider whether the observed value is greater than expected by chance. One difficulty with the methods is that the test statistics are a bit sensitive to the true value of Theta, which is generally not known (Hudson, 1993). The HKA protocol generates estimates of Theta, and does so using considerably more information (and more assumptions) than any of the conventional estimators used by the D statistics.

BASIC FEATURES OF THE HKA PROGRAM

The program can handle a very large number of loci (default compilation is for up to 20), with sample sizes per species per locus up to 350 (192 on the PowerPC).

The sample size for the second species can be 1 sequence, as in the original implementation of the test (Hudson et al., 1987)

The program will conduct rapid coalescent simulations using parameters estimated from the data.

The program will conduct a variety of tests based on Tajima's D statistic (Tajima, 1989) and Fu and Li's D statistic (Fu and Li, 1993)

______________________

Downloadable Files�� Return to Contents

______________________

_______________________

�Input File Format�� Return to Contents

_______________________

Input Data File Format:

Line 1 - a line of text. If desired, up to 10 extra lines of commentary can be added following line 1. Each additional line of commentary must have a '#' at the very beginning of the line. All comment lines will be reproduced in the output file.
line 2 - the number of loci = numloci
line 3 - the names of the two species, species 1 and species 2
the next numloci lines each have the following items
- locus identifier
- inheritance scalar (1.0 for autosomal 0.75 for X linked 0.25 for extrachromosomal or Y)
- mean sequence length for species 1
- mean sequence length species 2
- mean sequence length in comparisons between species
- # of sequences for species 1
- # of sequences for species 2
- # of polymorphic sites species1
- # of polymorphic sites species2
- mean pairwise divergence between species
- Tajima D for species 1 (Tajima, 1989)
- Tajima D for species 2 (Tajima, 1989)
- Fu & Li D (rooted) for species 1 (Fu and Li, 1993)
- Fu & Li D (rooted) for species 2 (Fu and Li, 1993)

Thus the basic format is of a table. With K loci the table would have K rows. Each row begins with a locus name and is followed by 13 numbers.

If one species has only a single sequence at one, at one or any of the loci, then that data set should be reduced to just one sequence for all of the loci for that species. In addition the species represented by just a single sequence for each locus should be species 2.

Inclusion of Tajima's and Fu&Li's statistics is optional and they need not be included. However if they are the program will test their significance. It will also conduct tests of the overall mean among these statistics and of their variances among the loci.

If tests of D statistics are desired, but the values for some loci are not known or are not calculable, then the corresponding value in the input table should be set to -10 or less.

_______________________

�Running the Program�� Return to Contents

_______________________

The program file should reside either in the same folder as the data file or in a folder automatically searched by the operating system. The program can be run using command line parameters, or by simply typing the name of the program ('hka').

Command line parameters:

-D (or -I) Input file name e.g. -Dmydata

-R Output file name e.g. -RMyresults

-S Number of simulations e.g. -S1000 (the maximum is 10000)

-T Do tests of Tajima's D statistic

-F Do tests of Fu and Li's D statistic

-M comment line e.g. -Mtest_my_data

To start the program with command line parameters:

For example, type:

hka -Dmydata -Rmyresults -S1000 -Mtest_my_data -T -F

To start the program without command line parameters, simply enter 'hka'. The program then asks for all of the necessary information

On a PowerPC, clicking on the program icon opens a small window in which command line parameters can be entered (e.g. -Dmydata -Rmyresults -S1000 -Mtest_my_data -T -F ). The user can also just hit return at this point and the program will request runtime parameters.

_______________________

Output�� Return to Contents

_______________________

The output is all included in the Output file, and includes the following:

Basic information about input data, including comments from input file, and numbers and sizes of loci.
Tables of Observed and Expected Values. There are two tables, one for polymorphism within species and one for divergence between species. These tables include the observations, as well as the expected values after fitting the HKA model, the variance under the model, and the deviation for each observation (i.e. (observed - expected)^2/variance) )/
The X^2 statistic (i.e. the sum of deviations) is compared to the chi square distribution with appropriate number of degrees of freedom. For two species and K loci there are 2*K-2 degrees of freedom. When one species is represented by just a single sequence at each locus, there are L-1 degrees of freedom.
If simulations were done:
- The observed value of X^2 is also placed within the simulated distribution of X^2.
- The cumulative distribution of the X^2 statistic is given together with the corresponding chi square distribution.
- a TEST OF MAXIMUM CELL VALUE is conducted (Wang and Hey, 1996). The result is very simply the location of the largest observed deviation value in the distribution of largest deviation values, regardless of species, or locus and regardless of whether the observation corresponds to polymorphism or divergence. If the distributions of all cells of the data matrices are equally distributed, as would be the case under the assumptions of the method, then the properties of this test could be predicted. If that does not hold, then the test will be based primarily on those portions of the simulated data sets that tend to have positively skewed deviations from expectations.
- ESTIMATED PARAMETER VALUES AND SIMULATION STATISTICS. This is a table of parameter estimates, and, if simulations are done, of 95% confidence intervals and of simulation means. The simulation means should be close to the parameter estimates.
- TESTS OF MEANS AND VARIANCES OF TAJIMA D VALUES. If the input data contained Tajima D values, then the mean values, among loci, are compared with the distribution of simulated mean values.
- TESTS OF INDIVIDUAL TAJIMA D VALUES. These are tests of observed values of Tajima's D for each locus against the simulated distributions.
- The same types of tests are done using Fu and Li's D statistic.

_______________________

�Program Limitations�� Return to Contents

_______________________

The program can handle very large data sets. The distributed version will handle 20 loci and up to 350 individuals per species per locus. When sample sizes are modest it does simulations quickly.

The most glaring limitation is that the program does not incorporate recombination into the simulations. Recombination reduces the variance of polymorphism levels, and for some purposes it would be nice to include this.

_______________________

�Literature Cited�� Return to Contents

_______________________

Fu, Y. X., and W. H. Li, 1993 Statistical tests of neutrality of mutations. Genetics 133: 693-709.

Hudson, R. R., 1990 Gene genealogies and the coalescent process, pp. 1-44 in Oxford Surveys in Evolutionary Biology, edited by D.

Futuyma and J. Antonovics. Oxford University Press, New York.

Hudson, R. R., M. Kreitman and M. Aguad�, 1987 A test of neutral molecular evolution based on nucleotide data. Genetics 116: 153-159.

Kimura, M., 1968 Evolutionary rate at the molecular level. Nature 217: 624-626.

Kimura, M., 1969 The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics 61: 893-903.

Tajima, F., 1989 Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123: 585-595.

Wang, R. L., and J. Hey, 1996 The speciation history of Drosophila pseudoobscura and close relatives: inferences from DNA sequence variation at the period locus. Genetics 144: 1113-26.

Watterson, G. A., 1975 On the number of segregating sites in genetical models without recombination. Theor. Pop. Biol. 7: 256-275.

This page was last changed September 04, 2013