**SITES** -- DNA POLYMORPHISM ANALYSIS PROGRAM --

**WH** -- A COMPUTER PROGRAM FOR ISOLATION MODEL FITTING--*

DOCUMENTATION�

Jody Hey

Department of Genetics

Rutgers University

Nelson Biological Labs

604 Allison Rd.

Piscataway, NJ� 08854-8082

732-445-5272

fax 732-445-5870

hey@biology.rutgers.edu

https://ccgg.temple.edu/heylab/home

*Some key internal parts of this program derive from the first program for the method that was written by John Wakeley.

This computer program and documentation may be freely copied and used by anyone, provided no fee is charged for it.

_______________________

�Contents

_______________________

Overview
Downloadable Files
Input File Format
Running the Program
Output
Program Limitations
Literature Cited

______________________

�Overview ��

______________________

WH is a program that fits a simple speciation model, called the Isolation Model, to multilocus DNA sequence data sets. The isolation model assumes the following:

Two species, or populations, from which the data have been sampled, arose from a single ancestral species, or population, some time, t generations, in the past.
The common ancestral species, or population was of constant effective size NA.
The two descendant species (populations) have contant effective sizes N1 and N2, respectively.
Except for the moment of population separation at time t, population sizes are constant.
Since population separation, there is zero gene flow between the species (populations).
All mutations are neutral.

WH implements the methods described in

Wakeley, J., and J. Hey, 1997 Estimating ancestral population parameters. Genetics 145: 847-855.

Wang, R. L., J. Wakeley and J. Hey, 1997 Gene flow and natural selection in the origin of Drosophila pseudoobscura and close relatives. Genetics 147: 1091-1106.

______________________

Downloadable Files�� Return to Contents

______________________

_______________________

�Input File Format�� Return to Contents

_______________________

Line 1. A line of text that generally provides a brief description of the data set. If desired, up to 10 extra lines of commentary can be added following line 1. Each additional line of commentary must have a '#' at the very beginning of the line. All comment lines will be reproduced in the output file.
Next Line � includes 1 required number and optionally 2 additional values
First Required Value - an integer corresponding to the number of loci.
First Optional Value - The mutation rate per year since divergence of the species, totaled across all of the genes
Second Optional - The number of generations per year for the organisms in the species
- NOTE - the mutation rate per year and the number of generations are not required for the primary analysis. If you desire estimated parameter values in terms of actual population sizes, and years, then estimates of these values should be provided.
Next line has two species names for species 1 and species 2. Each name can be up to 10 characters in length.
There follows one line for each locus. Each line has up to 17 items depending on the tests that are to be done.

- If all that is to be done is a basic fitting of the isolation model, and if no tests are to be done, then each line is required to have 9 items - see DATA LINES below.

- If simulations and statistical tests are to be done, then each line also requires 2 additional items � the population recombination rate estimates for each species

- If linkage disequilibrium tests are also to be done, then there will be an additional 6 items, for a total of 17.

DATA LINES

FOR EACH LOCUS, ONE LINE PER LOCUS, IN ORDER:

locus name (a string of no more than 10 characters)
inheritance scalar (1.0 autosomal; 0.75 X �linked; 0.25 mitochondrial or Y-linked)
length in base pairs
# sequences for species 1
# sequences for species 2 (total for both species per locus not over 32 (64 when compiled under Microsoft Visual C++)
Sx1 - polymorphisms exclusive to species 1
Sx2 - polymorphisms exclusive to species 2
Ss - shared polymorphisms
Sf - fixed differences

If Simulations are to be done then then each line should also have:

4Nc (population recombination rate) in species 1 (use ��1� if not known)
4Nc (population recombination rate) in species 2 (use ��1� if not known)

If tests of Linkage disequilibrium are to be done then each line should also have:

LD among �shared� polymorphisms for species 1
LD among �shared� polymorphisms for species 2
LD among �shared� polymorphisms minus LD among and non-shared polymorphisms for species 1.
LD among �shared� polymorphisms minus LD among non-shared polymorphisms for species 2.
LD among �shared� polymorphisms minus LD between �shared� and non-shared polymorphisms for species 1.
LD among �shared� polymorphisms minus LD between �shared� and non-shared polymorphisms for species 2.

Note if there is not an LD measurement for a locus, -10 is used.

Note on simulations.

The simulation results are sensitive to the amount of recombination. In the published descriptions of these simulations (Wang, Wakeley and Hey, 1997; Kliman et al., 2000) we used the gamma estimator of recombination (Hey and Wakeley, 1997). This estimator tends to have a bias such that the estimates are lower than the expected value of the parameter. The result of having lower recombination is to raise the variance of the observations (of exclusive, shared and fixed variants) and thus to broaden the distribution of test statitics of the fit of the model to the data. In this sense, the tests should be conservative. However this is not guaranteed, and users may want to exam the quality of the fit between the model and their data by considering a range of recombination rates.

Recombination rate estimates are sometime not available for both species. Also they are never available for the common ancestral species. Following is the method of assignment of population recombination rates:

- The program takes 4Nc1i as input (4Nc for species 1, locus i) and then sets

4Nc2i = 4Nc1i theta2/theta1

4NcAi = 4Nc1i thetaA/theta1

-Obtaining 4Nc1i depends on whether one has estimates for species 1 or species 2 or both. If only 4Nc1i is available, then that's it. - If only 4Nc2i is available, then 4Nc1i = 4Nc2i theta1/theta2. If both are available, then 4Nc1i = (4Nc1i + 4Nc2i theta1/theta2)/2.

_______________________

�Running the Program�� Return to Contents

_______________________

The program file should reside either in the same folder as the data file or in a folder automatically searched by the operating system. The program can be run using command line parameters, or by simply typing the name of the program ('wh'). If command line parameters are not used, the program asks for the values of runtime parameters.

The user starts the program simply by going to the folder where the data file exists and typing the name of the program (e.g. 'sites') followed by the� enter key. The program asks several questions about the data file and the� desired analysis. Nearly all commands and options can also be entered using command line parameters.

The program can be started with or without the use of instructions at the command line.

Without command line instructions - simply type �wh� at the prompt.

The program will ask for basic information.

On a PowerPC, clicking on the program icon opens a small window in which command line parameters can be entered. The user can also just hit return at this point and the program will request runtime parameters.

Command Line Parameters:

Type and enter

wh -d'datafilename' -r'resultsfilename' -N'numsims' -L'ldtype' -A'ranseed'

Where:

'datafilename' is the name of a data file that is in the same folder as the program
'resultsfilename' is the name of the file that will be created to contain the results. The results file will be generated with an extension of '.wh'
'numsims' is an integer of the the number of simulations to be run. Basic model fitting does not require simulations. If this value is not zero or blank, the data file must have the necessary information on recombination rates.
'ldtype' tells the program what type of linkage disequilibrium value to calculate. This option requires that simulations be done, and that the data file include information on recombination and observed LD values. 'ldtype' must correspond to the measure of linkage disequilibrium used to generate the datafile. 'ldtype' can take on the following values:
- 'R' invokes the correlation coefficient
- 'S' invokes r^2, the square of the correlation coefficient
- 'D' invokes the basic gametic disequilibrium measure D
- 'P' invokes the D' which is equal to D/ Dmax, where Dmax is the maximum possible value of D given the observed allele frequencies.
- 'B' invokes the absolute value of D, |D|
'ranseed' is a random number seed and is not generally required unless the user wants to repeat simulations exactly.

_______________________

Output�� Return to Contents

_______________________

Output is contained in the results file. There are three main sections: INPUT; MODEL FITTING RESULTS; and SIMULATION RESULTS.

INPUT simply lists in tabular form the data in the input file.

MODEL FITTING RESULTS lists the following:

Basic Parameter Estimates - these are the overall model parameter estimates as described in Wang, Wakeley and Hey (1997)
Expected Polymorphism Levels - this table simply lists the observed numbers of exclusive, shared and fixed polymorphisms together with the expected numbers given the parameter estimates.
Other Calculations - including alternative time parameterizations, and estimates of effective population size if the optional parameters were provided at run time.

SIMULATION RESULTS lists the following:

The results of tests of the quality of fit of the isolation model to the data. Two test statistics are used, the chisquare (as described in Kliman et al., 2000) and the wh statistic which is the original statistics described by Wang Wakeley and Hey (1997).
A table of parameter value estimates, together with simulated 95% confidence intervals and means. The means are useful to assess the bias of the parameter estimates.
A table of observed, and simulated mean levels of exclusive, shared and fixed variants. This table should resemble the table that compares observed and expected polymorphism levels.
Linkage Disequilibrium Analysis - several tables list comparisons between observed levels of disequilibrium among shared polymorphisms, among exclusive polymorphisms, and between shared and exclusive polymorphisms.

______________________

�Program Limitations�� Return to Contents

_______________________

For simulations, the program can only handle a total sample size for each locus of 32. If the program is compiled under Microsoft Visual C++ (as the distributed Win32 version is) then it can makes use of a compiler extension and can handle total per locus sample sizes of 64.

For basic model fitting, without simulations, larger sample sizes can be used.

During simulations, recombination within a locus can occur only between sequence segments. The program has been compiled with 50 segments per sequence, which should be more than sufficient for most data sets. However it is possible that this will not be sufficient for loci with long sequences and high amounts of recombination.

_______________________

�Literature Cited�� Return to Contents

_______________________

Hey, J., and J. Wakeley, 1997 A coalescent estimator of the population recombination rate. Genetics 145: 833-846.

Kliman, R. M., P. Andolfatto, J. A. Coyne, F. Depaulis, M. Kreitman et al., 2000 The population genetics of the origin and divergence of the Drosophila simulans complex species. Genetics 156: 1913-31.

Wakeley, J. and J. Hey. 1997 Estimating ancestral population parameters.Genetics 145, 847-855.

Wang, R. L., J. Wakeley and J. Hey, 1997 Gene flow and natural selection in the origin of Drosophila pseudoobscura and close relatives. Genetics 147: 1091-106.

This page was last changed September 04, 2013