**SITES**
-- DNA POLYMORPHISM ANALYSIS PROGRAM --
DOCUMENTATION
Jody Hey
Department
of Genetics
Rutgers
University
Nelson
Biological Labs
604
Allison Rd.
Piscataway,
NJ 08854-8082
732-445-5272
fax
732-445-5870
https://ccgg.temple.edu/heylab/home
* This computer program and documentation may be freely copied and used by anyone, provided no fee is charged for it.
_______________________
_______________________
______________________
_______________________
SITES is a computer program for the analysis of comparative DNA sequence data. Basic analyses include: data summaries by polymorphism class; polymorphism estimates within and between groups (species); estimates of migration, neutral model, and recombination parameters; and linkage disequilibrium analyses. SITES is primarily intended for data sets with multiple closely related sequences. It is especially useful when multiple sequences have been obtained from each of one or several closely related populations or species.
SITES can handle large data sets, and is flexible with regard to the input data. With a few commands on the command line, or in response to menus, one can tailor a particular data set in different ways to suit different questions, including working on only a subset of the data and regrouping the data.
If you find the need to mention the program in
a publication, you may cite the
following reference which mentions the
program:
Hey, J and
J. Wakeley. 1997. A coalescent estimator of the population recombination rate. GENETICS 145: 833-846.
______________________
Downloadable Files Return to Contents
______________________
WIN32 package (including documentation, sample data file, and WIN32 program)
PowerPC package (including documentation, sample data file, and PowerPC program)
_______________________
New Features Return to Contents
_______________________
The major recent addition, as of 2001, is a suite of linkage disequilibrium analyses.
SITES will also generate lines suitable for input to the HKA and WH computer programs.
SITES will also now handle alternative genetic codes.
Note to previous users - some command line flags have changed - check MENUS..
_______________________
_______________________
_______________________
Input File
Format Return to Contents
_______________________
There are
three kinds of input format. SITES format, PHYLIP sequential format and PHYLIP
interleaved format. PHYLIP is Joe Felsenstein's package of phylogenetic
programs. The first line of a PHYLIP format file begins with two integers.
SITES will check to see if the first character of a file is a number (i.e.,
0..9) and if so it will assume that the file is in PHYLIP format. If the first
character is not a number it will assume a SITES format. SITES format is very
similar to PHYLIP sequential format. If multiple analyses may be run, it is
most efficient to set up a SITES format file.
Data
Restrictions.
The only
characters that are allowed in the sequences are 'A','a','G','g','C','c','T','t','N','n','-','.', and '*'. 'N' and 'n' represent base positions where the sequence is not known.
'-','*', and '.' represent base
positions where one sequence has a gap relative to another sequence. Other
characters cause the program to insert an 'N' in their place, and to display a
warning at runtime. Each DNA sequence must have exactly the same number of
characters. The only exception to this is that the data can have spaces (i.e. '
') which are ignored.
Below is a
sample SITES input file for 10 sequences each of 50 characters. There is one noncoding region extending
between bases 20 to 44, inclusive, and the first base of the coding region
(i.e. base 1) is in frame 3 of the
codon. There are four groups of sequences.
****
SITES sample
input
10 50
3 1
20 44
4
simulans 3
mauritiana 2
sechellia 2
melanogaster 3
SI-CA1
CAGGGTGTCCGACTCGGCCTACTCGAGCA......GCAACAGCCAGTCAC
SI-CA2
CAGGGTGTCCGACTCGGCCTACTCGAACAGCTGCAGCAACAGCCAGTCAC
SI-K1
CAGGGTGTCCGACTCGGCCTACTCGAACA......GTAACAGCCAGTCAC
MA-1
CAGGGTGTCCGACTCGGCCTACTCGAACAGCTGCAGCAATAGCCAGTCGC
MA-2
CAGGGTGTCCGACTCGGCCTACTCGAACAGCTGCAGCAATCGCCAGTCGC
SE-C1
CAAGGTGTCCGACTCGGCCTACCCGAACAGCTGCAGCAACAGCCAGTCGC
SE-P1
CAAGGTGTCCGACTCGGCCTACTCGAACAGCTGCAGCAACAGCCAGTCGC
ME-NJ1 CAAGGTGTCCGACTCGGCCTACCCGAACGGCTGCAGCAACAGCCAGCCGC
ME-K1
CAAGGTGTCCGACTCGGCCTACCCGAACAGCTGCAGCAACAGCCAGCCGC
ME-LI1
CAAGGTGTCCGACTCGGCCTACCCGAACAGCTGCAGCAACAGCCAGCCGC
****
For many
questions, intron and exon boundaries are not relevant. To run SITES without providing information on
introns and exons, simply indicate that the entire sequence is one large
noncoding sequence. In this case,
the value of the first coding base is
irrelevant. For example to do this
with the sample data set above, lines 3
and 4 of the data file would be:
1 1
1 50
PHYLIP sequential and interleaved formats:
_______________________
Running the
Program Return to Contents
_______________________
The program file should reside either in the same folder as the data file or in a folder automatically searched by the operating system. The user starts the program simply by going to the folder where the data file and the program exist and typing the name of the program (e.g. 'sites'). The program asks several questions about the data file and the desired analysis. Nearly all commands and options can also be entered using command line parameters.
On a PowerPC, clicking on the program icon opens a small window in which command line parameters can be entered. The user can also just hit return at this point and the program will request runtime parameters.
_______________________
_______________________
-----------------------
If the 'S'
flag is not used when the program is started, then the following menu will
appear.
SITE OPTIONS ARE
: a for All site types
i for all noncoding base
changes
e for all coding base
changes
s for synonymous coding
sites
r for replacement coding
sites
o for ambiguous coding sites
d for insertion/deletion
(indel) sites
f for only informative
sites
n for only transitions
v for only transversions
z
to skip all positions with more than 2 base values
x to skip all positions
within indels
TYPE THE LETTERS
OF THE DESIRED SITE TYPES (no spaces):
These options determine the information that is included in the polymorphism table, and they may affect which kinds of polymorphic sites are subject to other analyses. The user should enter a string of characters corresponding to those types of polymorphic sites that are to be included in the analyses. For example, an analysis on only informative synonymous sites would be called for by entering 'fs' or 'sf'. The most common analysis is done on all types of polymorphic sites (i.e. 'a'); 'ambiguous coding sites' are those for which the program could not determine whether a coding base change was a replacement or a synonymous change.
'z' will cause all those positions with more than two types of base values to be excluded from the analysis.
'x' will cause all those positions that are within regions for which some of the lines differ by indels to be excluded from the analyses.
By invoking
both 'z' and 'x', the data set can be reduced to just those sites that fit a
simple infinite sites model of mutation.
-----------------------------
If the 'A'
command is not used when the program is started, then the following menu will
appear.
ANALYSIS TYPES
ARE:
a for all basic types (except r,m,g,l)
s for basic site table
e for coding change details
c for codon usage tables
p for polymorphism analysis
i for indel site table & indels in
recombination analysis
r for recombination analysis
m for population model fitting
l (L) for linkage disequilibrium
analyses (invokes r)
g for GC analyses
TYPE THE LETTERS
OF THE DESIRED ANALYSES (no spaces):
'a'
calls for the most basic analyses, including s,e,c,p & i
's' calls
for a table of polymorphic base positions to appear in the output file.
'e'
calls for a table that contains the codons that are the sites of synonymous,
replacement, and ambiguous polymorphisms. This is especially useful for
determining the likely sequence of mutations when multiple polymorphisms occur
within a codon.
'c'
calls for a table of codon usage. Only the 'universal' code is used. Two tables
are printed, one for the first sequence in the data file, and one for all of
the sequences. Also included are counts of the numbers of synonymous and
replacement bases (i.e. relative proportions of random mutations expected to
cause synonymous or replacement changes). These counts of synonymous and
replacement bases can be used to calculate the number of synonymous
polymorphisms per synonymous base. Similarly, the counts of replacement bases
can be used to calculate the number of replacement polymorphisms per
replacement base.
'p'
calls for a set of analyses on polymorphism within and among groups.
'i'
calls for a polymorphism table for indel variation. This option also causes
indels to be included in some recombination analyses, if the 'r' option is also
used.
'r'
calls for a set of recombination analyses.
'm'
calls for fitting the ISOLATION SPECIATION MODEL to pairs of groups of
sequences, and the POPULATION SIZE CHANGE MODEL to each group of sequences.
These analyses are not applicable for many data sets, and they can be slow for
data sets with large groups.
'l'
( L ) calls for the Linkage Disequilibrium Options Menu to appear.
'g'
calls for a table of GC content by codon position and for noncoding regions,
for each of the sequences.
-----------------------------------
If the 'O'
command is not used when the program is started, then the following menu will
appear. Several of these commands require additional information that must be
entered in response to queries, or provided with other command line flags.
DATA
OPTIONS: g for comparisons within
groups
n for new group designations
d for dropping some sites
m for data limited to some
sites
x for dropping some sequences
c for non-canonical genetic code
OUTPUT OPTIONS:
s for no screen output during analysis
p suppress large pairwise
difference table
h text line of input values
for HKA program
w text line of input values
for WH program
f for first sequence reference
in site table
t for table style: (. and -)
replace (- and *)
TYPE THE LETTERS
OF THE DESIRED OPTIONS (no spaces):
DROPPING SOME SITES, enter option : r for range
f for file
i for individual entry
The user should enter 'r','f', or 'i', and follow the directions.
In the case of 'f', the user should previously have created a file that
contains base positions that are to be dropped. The format is simply a list of
numbers, one number per line. In the
case of 'r' or 'i', the program asks the user for information on the range
limits, or individual base positions, respectively.
Linkage
Disequilibrium Options Menu
-------------------------------------------
If the 'L'
command is not used on the command line, and the 'L' option is listed in
response to the Analysis Choice Menu then the following menu will appear. Some of these commands require
additional information that must be entered in response to queries, or provided
with other command line flags.
Linkage
Disequilibrium (LD) Analyses:
m print matrix of pairwise values and
significance tests
y measure and compare average LD by
regions
t randomization tests of average LD
by regions
s analyze LD among polymorphism
shared between groups
x exclude singletons from all LD
analyses
Apply Analyses
to the Following LD Measures
d
D - standard linkage disequilibrium (default)
p
D' - D prime = D/Dmax
b
|D| - absolute value of D
a
|D'| - absolute value of D prime
r
correlation coefficient
q
r^2 - squared correlation coefficient
TYPE THE LETTERS
OF DESIRED ANALYSES AND MEASURES (no spaces):
_______________________
Command line usage Return to Contents
_______________________
Nearly all commands and options can be given at the command line. Usage of the command line permits many analyses to be automated, so that they can be very easily repeated and, if desired, given in batch files.
Each command
line parameter flag may be upper or lower case and may be preceded by a '-',
'\' or '/'. Following a command line parameter is a string of characters appropriate to that command.
Flag parameter string example
---------------------------------- ----------------
I the name of the data file -Imydata.seq
R the name of the output file -Rmyresult
M message up to 78
characters (no blanks) -Mmy_data
S kinds of polymorphic sites to analyze -Sa (see Site Options Menu
above)
A kinds of analyses to perform -Aspr (see Analysis
Menu above)
O data and output options
-Ogc (see Data and Ouput Menu above)
L linkage disequilibrium options -Lyrp (see LD Analysis Menu)
C
alternate genetic
code
-Cf (see Alternative Genetic Code Menu above)
Any of
these parameters can be given at the command line when the program is started.
Either uppercase or lowercase letters may be used for flags and for their
parameter strings. For example, the following command line, entered in a
command prompt window will generate a file called myresult.sit.
sites -isitestestdata -rmyresult -mmy_data
-sa
-aspr -ogc -cf
For each
of the parameter flags that are not included at the command line when the program is started, the program
will ask for the information. It is not
required that any of the command line parameter flags be used. The program will also ask for more information if phylip
style data formats are used, and if
some of the analysis options are used. The data file can have any name, though it should not include the folder information. This usually means that
the program and the data need to reside in the same folder. The name of the
output file should not have an extension. The characters '.SIT' will be added to the output file name that is given
to the program. Thus, for the example
above, the program would produce a file called 'myresult.sit'.
Three of
the flags ('S', 'A', and 'O') correspond to menus (see below). If the flag is not used at the command line,
a menu will appear, requesting user
input.
Additional
command line options can be used to limit the size of the data set if the 'O'
flag or the L flag is used. These
secondary flags should be placed after the respective primary flag.
Note that
command line flags for dropping sequences can be used either in conjunction with
flags for excluding base positions, or with flags for keeping certain base positions.
_______________________
_______________________
When running,
the program writes some messages to the screen (unless the 's' ANALYSIS OPTION
is used). While reading in the data, the program writes the base position of each polymorphic site
as it is found. Following this, the program writes a brief message for each
category of analysis being conducted.
The output
of the program is all contained in one file that has a '.SIT' extension. This
file can be quite long for large data sets and complete analyses. If there are
many sequences, the 'p' option in the OUPUT OPTIONS menu can cut down on the
size of the data set.
For the
most part, the formatting of the output is not hard to follow. At the top of
the file, the run parameters are listed, as are the sequence and group names.
For a full analysis of a data set with multiple groups, the following headings are generated:
-------------------------------
This table
provides the base position and type of polymorphism for all variable base
positions in the data set (excluding those sequences and regions of sequences
that were not included in the analysis).
-----------------------
An
approximate list of counts of the number of base positions associated with each
kind of polymorphisms in the table. These counts are rough because some base
positions may have multiple kinds of polymorphisms.
--------------
A set of
two tables: The first is very similar
to that for polymorphic sites, except that each distinct indel (regardless of
length) gets just one position. The second table lists the sequence and
position of each distinct indel.
Table of Synonymous, Replacement and
Uncertain Exon Changes
--------------------------------------------------------------------------
A list of
all (or most) coding region base changes. With each polymorphism the different
codon states are shown. This can be used to resolve cases where multiple
changes have occurred within the same codon, and where it was not clear to the
program whether a change was synonymous or replacement. This analysis may not be complete if there
are three or four bases segregating at a position, or in cases where a single
aligned codon has many amino acids segregating.
---------------------
A table of
codon usage for the first sequence. Also included are counts of the numbers of
synonymous and replacement sites (i.e. relative proportions of random mutations
expected to cause synonymous or replacement changes). The calculation of these values is best explained with an
example. Consider a codon (e.g. AAA for
lysine) and consider each base in the
codon. For each base, the
fraction of all mutations that will
change the codon in such a way that the amino acid does not change is counted. For AAA, 0 of the
three possible mutations at the first
base will lead to a synonymous change, similarly 0 for the second base, and 1/3
of the mutations for the third base
(because an A-> G change leads to a AAG which is also lysine). So the number
of synonymous sites in an AAA codon is
1/3. The number of replacement sites is 3-1/3 = 2 2/3. Every codon gets a score
this way, and the final tally is just the sum of scores for all codons.
Codon
Usage Table For All Lines
-------------------------------------
A table of
codon usage for all sequences and counts of synonymous and replacement sites
summed across all sequences.
--------------------------
Most analyses
are applied only to those types of sites that
are specified in the SITES OPTIONS menu, but there are exceptions - see
below.
BASE
PAIR COMPARISONS - not including N's or indels - counts of the number of bases
compared and the number of differences
for all pairs of sequences. Essentially a
distance matrix. The counts of the numbers of bases compared are based
on the entire sequence, or a shorter region, depending on the 'x' SITE OPTION,
and the 'l' and 'm' ANALYSIS
OPTIONS. The counts of the number of
bases compared are not reduced by other
SITE OPTIONS choices. However, the
counts of site differences are affected by SITE OPTION choices. This table is
not generated if the 'p' option is used in the OUTPUT OPTIONS.
GROUP
DIFFERENCES - not including N's or indels -a matrix with group by group
comparisons. Above and on the diagonal
are the average pairwise differences for those
sites specified under SITE OPTIONS. Below the diagonal is the net
average pairwise divergence (e.g. Nei, 1987, p.276).
GROUP
DIFFERENCES PER BASE PAIR - same as above but numbers are per base pair. Note
that the divisor is the average # of base pairs compared, calculated from the
same numbers above the diagonal in the BASE PAIRS COMPARISON table. Thus these
per base pair measures may be misleading for some SITE OPTIONS choices. For example, if only synonymous
sites are analyzed (option 's' under SITE OPTIONS), a per base pair measure of
divergence can be obtained by dividing numbers in the GROUP DIFFERENCES matrix,
by the estimated number of synonymous sites, that are given beneath the CODON
USAGE TABLE.
FIXED DIFFERENCES - A fixed difference is a polymorphic site at which all of the sequences of one group are different
from all of the sequences of a second group.
SHARED
POLYMORPHISMS - A shared polymorphism
is a polymorphic site at which each of two groups of sequences are found to
have at least two of the same bases.
Fst AND POPULATION MIGRATION RATES - Fst
values, between pairs of populations, and estimates of Nm, assuming diploidy (i.e. N is the
effective number of diploid individuals) calculated according to equation 4 (except using a factor of 1/4) of
Hudson et al., 1992. This estimate
should be multiplied by 4/3 to get the
corresponding number for an X linked locus. It
should be multiplied by 2 to get the corresponding number for a haploid
locus. Also if the locus is haploid and
sex-limited (e.g. mitochondria), the estimate should be multiplied by 2, and then it applies only to
the number of individuals of the sex
that carry the locus.
POLYMORPHIC
SITE FREQUENCIES PER GROUP FOLDED
& ROOTED -These tables give the counts of the number of lines that
carry a polymorphism of a certain frequency. For example, a site in which two
sequences are different from the remaining n-2 sequences is counted in either
category 2 or category n-2 (depending
on whether the distribution is folded, or if
not depending on the value of the root sequence. There
are two tables, one for the folded distribution which shows only the frequencies for the rarest
bases, and one for the rooted distribution.
A rooted distribution is not shown if the analysis includes only one
group. The table for the rooted
distribution is based on an outgroup sequence chosen from one of the
other sequence groups. The method for
picking outgroups is very simple. The
outgroup sequence for one group is the
first sequence listed in the most divergent other group. This may not be an ideal outgroup for
various reasons. Some properties of
these distributions are known for some
models and the distributions are useful for considering questions about
changing population size and natural selection. See papers by Tajima and also
by Fu.
SITES
WITH MORE THAN TWO BASES SEGREGATING - A number of analyses assume an infinite
sites model, under which a polymorphic site is caused by exactly one mutation
and there can be no more than two bases segregating. This table lists those
sites that are clearly not consistent with this assumption. If desired, the
program can be rerun using analysis option 'd' to exclude these sites.
D STATISTICS
- These indices are measures of departure from a neutral Fisher-Wright
model. See Tajima (1989) and Fu and Li (1993).
These statistics rely on a count of the number of polymorphic sites. The
counts that are used come directly from the site frequency distribution.
However these counts will underestimate
the actual number of mutations,
particularly if some sites are segregating more than two bases within a
sequence sample group. For Fu and Li D,
an outgroup sequence is picked from among
the other groups in the data set (if any occur - see above for
POLYMORPHIC SITE FREQUENCIES). Note,
the outgroup sequence that is picked may not be ideal, depending on the divergence
among groups. Also given are counts of
the different classes of mutations, as defined by Fu and Li.
THETA
(4Nu) ESTIMATES - Two different estimates of the neutral mutation parameter
4Nu: Watterson s and nucleotide diversity or pi. See, for example, Hudson (1990) or Tajima (1993). Also listed are
the number of sequences for each group and the number of bases, which is
calculated by taking the average of the number of bases compared in the pairwise comparison matrix.
Historical
Population Model Fitting
----------------------------------------
This
section provides the results of the fitting of specific historical models to the data. These models
are described in Wakeley and Hey (1997). The methods work best for data sets for which the models are a close fit to
reality, and for which there are many polymorphisms and lots of recombination. For most data sets, the
models do not fit very well, and the fitting methods are not able to return a
reasonable solution.
-----------------
POPULATION
RECOMBINATION PARAMETER (4Nc) ESTIMATES
- Two estimates of 4Nc the population recombination rate are generated: gamma
(Hey & Wakeley, 1997) and Hud4Nc (Hudson 1987). c is the crossing over rate for the sequenced region per generation. These estimates assume
diploidy, and must be adjusted
otherwise. Also listed is the estimated
ratio of the number of recombination events per mutation (c/u). This is calculated simply by dividing
gamma by Theta (i.e. estimated 4Nc, divided
by estimated 4Nu).
A very important assumption of these methods is that just a single mutation has caused each polymorphism
within a group. gamma in particular is quite sensitive to a violation of this infinite sites assumption. Hud4Nc may not be as sensitive, however this estimator suffers
from extreme bias (estimates tend to be
way too high) if the numbers of sequences or polymorphisms are low (Hudson,
1987).
CONGRUENCY
AND INTERVAL TABLES For each group,
two tables are printed out.
o TABLE OF INFORMATIVE SITES. - This
is a matrix that I sometimes find useful for thinking about recombination. It
lists all informative polymorphic sites and the values carried by each
sequence. A site is considered informative if there are at least two bases that
occur in at least two of the sequences. 'rarecount' is the number of times the
less common base occurs. The matrix tells whether each pair of sites is
congruent. Two sites are congruent if together they do not reveal all four
gametic types. For example a G-C polymorphism and an A-T polymorphism would be
congruent if there occur individuals of just three of the possible gametic
types (e.g. GT, GA, and CT, but not CA). Under the infinite sites model all
four types can only occur if there has been crossing over or gene conversion.
The matrix has a '*' or a '+' at pairs of congruent sites, and a blank for
incongruent pairs. Each site with itself gets an 'S'. Thus a completely
congruent data set would be all S's and *'s. The more recombination the more
there is a tendency for blocks of *'s and +'s to be broken up. It can be a
handy visual aid, but is unwieldy with too many polymorphisms. It also assumes
an infinite sites mutation model. The
matrix can be enormous if there are a
great many polymorphisms. The program
will only print up to 245 columns (i.e.
245 informative polymorphic
sites). The matrix does not deal
well with sites that exhibit more than 2 bases in a group. One of the bases
gets arbitrarily lumped with one of the others.
o MINIMAL SET OF RECOMBINATION
INTERVALS - This table lists the
smallest set of locations of
recombination events in the history of the sample. See Hudson and Kaplan (1987) for full
details. Under the model, each interval must bracket at least
one recombination event.
Linkage
Disequilibrium Analyses
-------------------------------------
AVERAGE
LINKAGE DISEQUILIBRIUM BETWEEN REGIONS
- If the 'y' or 't' option is listed in the LD Menu, then an analysis is
conducted of the mean LD between regions of the sequence. These regions are listed in a separate file,
the name of which is either given on the command line (-G flag) or
requested. For each region, and each
pair of regions the mean LD is calculated between all pairs of sites. These
values are listed in a matrix that also shows separately for each group how
many sites could be compared for each mean
(if there are less than 10 pairs of sites, the mean is not shown).
A different table is generated for every LD measure that was
called for (LD MENU or '-L' command line flag).
If the 't' option is listed in the LD Menu, then a randomization
test is conducted of the mean LD values. For each randomization mean LD is
calculated within and between each
region, and the simulated mean values are compared to the actual values. For randomizations within
regions, the base values for each
polymorphic site within a group are randomly shuffled among the sequences of that group. For randomizations between regions the
complete sequences for each region, within each group, are randomly shuffled so
that the sequences remain intact but they linkage with respect to other regions
is randomized.
A total of 1000 LD region randomizations are done. If screen output is not suppressed the number of
simulations are listed on the screen. These tests can take seconds or days
depending on the size of the data set and the speed of the computer, so it is
best not to turn of screen output.
the test results are shown in the tables of means, with symbols to
indicate the level and direction of significant departure from random.
MATRICES
OF LINKAGE DISEQUILIBRIUM VALUES AND TESTS -These matrices will be large if
there are many polymorphic sites within a group. They are printed out in a
separate file, the name of which is either given on the command line ('-E'
flag) or is requested by the program in response to the 'm' option in the
LD OPTION MENU. For each group there is a matrix of tests of
association. Fisher's Exact Test is shown above the diagonal, and a chi-square
test is shown below the diagonal.
For each group and each LD measure requested, a matrix is given
that shows the value of LD between all pairs of sites for which values could be
calculated. Values are shown to just 1
figure so as to have the matrix in manageable space.
Following each matrix is given the mean LD value for that matrix.
LD
ANALYSES AMONG SHARED POLYMORPHISMS - A series of matrices are listed, each containing LD
calculations involving shared polymorphisms between species. These can be
useful in trying to figure out whether shared polymorphisms are due to gene
flow or due to shared ancestry.
-----------------------
A table of
the number of bases at each codon position, and the GC percentage for each DNA sequence.
Formatted Values for Input to HKA
Program
--------------------------------------------------
The HKA
program carries out the HKA test (Hudson et al., 1987) for multiple loci. It can
be conducted for pairs of species or for one species and a single outgroup
sequence. HKA program input lines are
generated for all possible pairs of groups.
Formatted Values for Input to WH
Program
------------------------------------------------
The WH
program carries out the analyses described in Wakeley and Hey (1997) and in Wang, Wakeley and Hey (1997).
WH input lines are generated for all possible pairs of species.
_______________________
Program
Limitations Return to Contents
_______________________
The
program requires that the DNA sequences be aligned prior to analysis. SITES
does not carry out any phylogenetic analysis. It does not estimate trees, and
it does not endeavor any assessment of multiple hits. In general the situations for which SITES is intended are ones in
which there have not been many cases of
multiple mutations at base positions and where there have not been so many insertion/deletion events
that alignment is difficult.
The program can handle large amounts of data, depending on available RAM. If there is sufficient RAM then the distributed version of the program can handle up to 200 sequences, each of 20,000 base pairs, and up to 20 groups. These values can be increased by changing constants in sites.h and recompiling the program.
The
program is quick for basic polymorphism analyses. Recombination, Population Model Fitting, and Linkage
Disequilibrium analyses can be time consuming for large data sets.
If you
have a data set that crashes or hangs the program, you can send it to me with a
description of the problem. Depending on time, I'll try and find the problem.
_______________________
Literature Cited Return to
Contents
_______________________
Fu, Y. X.,
and W. H. Li, 1993 Statistical tests of neutrality of mutations.
Genetics 133: 693-709.
Hey, J. and
J. Wakeley. 1997. A coalescent estimator of the population
recombination rate. Genetics 145, 833-846.
Hedrick,
P. W., 2000 Genetics of Populations.
Jones and Bartlett, Sudbury, MA.
Hudson, R.
R., 1987 Estimating the recombination parameter of a finite
population model without selection.
Genet. Res. Camb. 50: 245-250.
Hudson,R.R.
1990 Gene genealogies and the coalescent process. In: Oxford
Surveys in Evolutionary Biology. Vol. 7.
(Eds: Harvey,PH; Partridge,L)
Oxford University Press, New York, (1-44)
Hudson, R.
R., and N. L. KAPLAN, 1985 Statistical properties of the number of
recombination events in the history of a
sample of DNA sequences.
Genetics 111: 147-164.
Hudson, R.
R., M. Kreitman and M. Aguad , 1987 A test of neutral molecular
evolution based on nucleotide data.
Genetics 116: 153-159.
Hudson,R.R.,
M.Slatkin and W.P. Maddison 1992 Estimation of levels of gene
flow from DNA sequence data. Genetics
132: 583-589.
Kliman,R.M.
and J. Hey 1993 DNA sequence variation at the period locus within
and among species of the Drosophila
melanogaster complex. Genetics 133:
375-387.
Nei,M
1987. Molecular Evolutionary Genetics. Columbia University
Press, New York.
Schaeffer,S.W.
and E.L. Miller 1992 Estimates of gene flow in Drosophila
pseudoobscura determined from nucleotide
sequence analysis of the
alcohol dehydrogenase region. Genetics
132: 471-480.
Tajima,
F., 1989 Statistical method for testing the neutral mutation
hypothesis by DNA polymorphism. Genetics
123: 585-595.
Tajima,F
1993 Measurement of DNA polymorphism. In: Mechanisms of Molecular
Evolution. (Eds: Takahata,N; Clarke,AG)
Sinauer Associates, Sunderland,
MA, 37-59.
Watterson,
G. A., 1975 On the number of segregating sites in genetical models
without recombination. Theor. Pop. Biol.
7: 256-275.
Wakeley,
J. and J. Hey. 1997 Estimating ancestral population parameters.
Genetics 145, 847-855.
Wang, R.
L., J. Wakeley and J. Hey, 1997 Gene flow and natural selection
in the origin of Drosophila
pseudoobscura and close relatives.
Genetics 147: 1091-106.
This page last changed November 10, 2015