Jody Hey           Evolutionary Genetics Laboratory

 Professor    -    Department of Genetics    -    Rutgers University

Hey Lab Research Publications Software, Data Contacts, People

**SITES** -- DNA POLYMORPHISM ANALYSIS PROGRAM --

 

DOCUMENTATION  

Jody Hey 

Department of Genetics

Rutgers University

Nelson Biological Labs

604 Allison Rd.

Piscataway, NJ  08854-8082

732-445-5272

fax 732-445-5870

hey@biology.rutgers.edu

http://lifesci.rutgers.edu/~heylab

 

* This computer program and documentation may be freely copied and used by anyone, provided no fee is charged for it. 

_______________________

 Contents

_______________________

______________________

 Overview   

_______________________

 

SITES is a computer program for the analysis of comparative DNA sequence data.  Basic analyses include: data summaries by polymorphism class;  polymorphism estimates within and between groups (species); estimates of migration, neutral model, and recombination parameters; and linkage disequilibrium analyses.  SITES is primarily intended for data sets with multiple closely related sequences. It is especially useful when multiple sequences have been obtained from each of one or several closely related populations or species.  

 

SITES can handle large data sets, and is flexible with regard to the input data.  With a few  commands on the command line, or in response to menus, one can tailor a particular data set in different ways to suit different questions, including working on only a subset of the data and regrouping the data. 

 

If  you find the need to mention the program in a publication,  you may cite the following  reference which mentions the program:  

Hey, J and J. Wakeley. 1997. A coalescent estimator of the population  recombination rate.  GENETICS 145: 833-846.

______________________

 

Downloadable Files              Return to Contents

______________________

 

_______________________

 

 New Features                    Return to Contents

_______________________

 

The major recent addition, as of 2001, is a suite of  linkage disequilibrium analyses.

SITES will also generate lines suitable for input to the HKA  and WH computer programs. 

SITES will also now handle alternative genetic codes.

Note to previous users - some command line flags have changed - check MENUS.. 

_______________________

 

 Analyses 

_______________________

 

  • Polymorphism Table SITES identifies polymorphic sites and generates a table summarizing polymorphism information. Polymorphisms are characterized with regard to several categories: Noncoding or Coding; Synonymous, Replacement or Ambiguous; Transversion or Transition; Insertion/Deletion (Indels)

 

  • Indel Table SITES tries to identify the boundaries of insertions/deletions (indels)  and generates a polymorphism table with its best guess of indel variation.

 

  • Codon Usage Tables (assuming the standard genetic code)

 

  • Numbers of synonymous and replacement base positions. These can be interpreted as the relative proportions of random mutations that would be expected to cause synonymous or replacement changes.

 

  • Numbers of pairwise differences among sequences.

 

  • GC content for complete sequences and each codon position.

 

  • Polymorphism Analyses  SITES conducts several kinds of polymorphism analysis that can be  applied to the entire data set or to multiple subsets of sequences.
    • The program generates the two most commonly used estimates of the neutral mutation parameter Theta (or 4Nu, where N is the effective population size and u is the neutral mutation rate):
      • pairwise nucleotide diversity or pi (Nei, 1987 p256).
      • Wattersons etimator (Watterson, 1975).
    • The program determines several measures of non-neutrality in the site frequency distribution, including Tajima's D (Tajima, 1989), Fu and Li's D and Fu and Li's D* (Fu and Li, 1993).
    • The program determines the site frequency distribution for each group, both rooted and unrooted.

 

  • Group Comparisons.  SITES works with groups of sequences, so that groups can be compared  with one another. A number of analyses are intended explicitly for group comparisons.

 

    •  The program determines the average number of pairwise  differences among all pairs of groups, as well as the net divergence  among all pairs of groups.
    • determines the numbers of shared  and fixed differences among groups.
    • estimates the population migration parameter (Nm) and Fst for all pairs of groups, following the method outlined in of Hudson et al., 1992.

 

  • Historical Population Model Fitting SITES carries out the analyses described in Wakeley &  Hey (1997, Estimating ancestral population parameters.Genetics 145, 847-855). The data are fit to two different models.
    • an isolation speciation model, with four parameters
    • a model of recent population size change, with three parameters

 

  • Recombination analyses. SITES conducts several kinds of recombination analysis on each group, or the entire data set.
    • a table of site by site congruency
    • a table of the minimum set of recombination intervals (Hudson & Kaplan, 1985)
    • Hey & Wakeley's (1997) gamma, an estimate of 4Nc, the population recombination rate.
    • Hudson's (1987) estimate of 4Nc

 

  • Linkage Disequilibrium analyses. SITES will conduct several analyses of Linkage Disequilibrium (LD)
    • matrices of LD values, for several measure of LD
    • matrices of statistical tests of LD values
    • calculates mean LD within and between regions of the sequence
    • does simulations to test mean LD within and between regions
    • determines various measures of LD among polymorphisms shared by groups

 

  • Data Subsets and ReGrouping The data set can be partitioned in numerous ways, without changing the data file.
    • the groupings of sequences can be changed at runtime
    • sequences can be dropped from the analysis at runtime
    • analyses can be limited to specific base pairs or specific intervals
    • specific base positions can be dropped from the analysis.
    • analyses can be limited to certain categories of polymorphic sites

 

_______________________

 

 Input File Format                Return to Contents

_______________________

 

There are three kinds of input format. SITES format, PHYLIP sequential format and PHYLIP interleaved format. PHYLIP is Joe Felsenstein's package of phylogenetic programs. The first line of a PHYLIP format file begins with two integers. SITES will check to see if the first character of a file is a number (i.e., 0..9) and if so it will assume that the file is in PHYLIP format. If the first character is not a number it will assume a SITES format. SITES format is very similar to PHYLIP sequential format. If multiple analyses may be run, it is most efficient to set up a SITES format file.

 

Data Restrictions.

 

The only characters that are allowed in the sequences are  'A','a','G','g','C','c','T','t','N','n','-','.', and '*'.   'N' and 'n'    represent base positions where the sequence is not known. '-','*', and  '.' represent base positions where one sequence has a gap relative to another sequence. Other characters cause the program to insert an 'N' in their place, and to display a warning at runtime. Each DNA sequence must have exactly the same number of characters. The only exception to this is that the data can have spaces (i.e. ' ') which are ignored.

 

SITES format:

  • Line 1. A line of text that generally provides a brief description of the data set. This line must not begin with a number. If desired, up to 10 extra lines of commentary can be added following line 1.  Each additional line of commentary must have a '#' at the very beginning of the line. All comment lines will be reproduced in the output file.
  • Line 2. Contains 2 integers. The first is the number of sequences. The second is the number of characters in each sequence. 
  • Line 3. Contains 2 integers. The first integer can be 1,2 or 3, and is the position in the coding frame of the first coding base. If there are no coding bases, then this position can be any number The second integer is the number of noncoding stretches of sequence. For instance if the DNA sequence contains a 5' non- coding sequence followed by three amino acid coding regions - with intervening introns- followed by a 3' non-coding region. Then this integer would be 4 (5' noncoding plus 2 introns plus 3' noncoding).  If the sequence does not include a protein coding region or if the coding frame of the first coding base is not known, then the entire sequence should be specified as non-coding, and the second integer should be 1.
  • After line 3 there is one line for each noncoding stretch indicated on line 3. Each line contains two integers: the base position of the first base in the noncoding region, and the base position of the last base in the region. The lines go in order from 5' to 3', meaning that the first line after line 3 contains the bases for the most 5' noncoding region.  If there are no coding regions in the sequence, then the second number on line 3 should be 1 and the  two integers on line 4 should simply be a 1, followed by the length of the sequences (i.e. indicating a noncoding region as long as the  sequence).
  • The line after those containing information on noncoding regions contains just a single integer, the number of groups of sequences. If sequences are not in groups then this number should be 1. If this number is greater than 1, then for each group there is a line containing the group name and the number of sequences in that group. A group name can be up to 12 characters. The group names should be ordered so that they are in the same order as the sequences.
  • After the lines containing the number of groups and the group names and sizes, come the sequences. The sequences must be in sequential format, meaning that all of the text for one sequence occurs before the text for the next sequence. There can be one sequence per line, however this is not necessary, and sequences can be spread out over multiple lines.  Only the following characters can be included in the sequence: A,a,C,c,G,c,T,t,N,n,'*','-','.'.  The latter three all indicated the absence of sequence relative to other sequences in the file. 
  • For each sequence, the first 10 characters contain the name of the sequence for that line. There may be spaces between the name and the beginning of the sequence, but the actual sequence must begin in column 11. The sequence name should begin with a letter or a number (no asterisks, or other symbols can be the first character in a sequence name).

 

Sample SITES Format

Below is a sample SITES input file for 10 sequences each of 50 characters.  There is one noncoding region extending between bases 20  to 44, inclusive,  and the first base of the coding region (i.e. base 1) is in frame 3 of the  codon. There are four groups of sequences.

 

****

SITES sample input

10 50

3 1

20 44

4

simulans 3

mauritiana 2

sechellia 2

melanogaster 3

SI-CA1    CAGGGTGTCCGACTCGGCCTACTCGAGCA......GCAACAGCCAGTCAC

SI-CA2    CAGGGTGTCCGACTCGGCCTACTCGAACAGCTGCAGCAACAGCCAGTCAC

SI-K1     CAGGGTGTCCGACTCGGCCTACTCGAACA......GTAACAGCCAGTCAC

MA-1      CAGGGTGTCCGACTCGGCCTACTCGAACAGCTGCAGCAATAGCCAGTCGC

MA-2      CAGGGTGTCCGACTCGGCCTACTCGAACAGCTGCAGCAATCGCCAGTCGC

SE-C1     CAAGGTGTCCGACTCGGCCTACCCGAACAGCTGCAGCAACAGCCAGTCGC

SE-P1     CAAGGTGTCCGACTCGGCCTACTCGAACAGCTGCAGCAACAGCCAGTCGC

ME-NJ1    CAAGGTGTCCGACTCGGCCTACCCGAACGGCTGCAGCAACAGCCAGCCGC

ME-K1     CAAGGTGTCCGACTCGGCCTACCCGAACAGCTGCAGCAACAGCCAGCCGC

ME-LI1    CAAGGTGTCCGACTCGGCCTACCCGAACAGCTGCAGCAACAGCCAGCCGC

****

For many questions, intron and exon boundaries are not relevant. To run  SITES without providing information on introns and exons, simply indicate that the entire sequence is one large noncoding sequence.  In this case, the  value of the first coding base is irrelevant.  For example to do this with  the sample data set above, lines 3 and 4 of the data file would be:

1   1

1   50

 

PHYLIP sequential and interleaved formats:

  • line 1: contains 2 integers. The first is the number of sequences. The second is the number of bases per sequence.
  • After line 1, are lines for the data. The data can be in sequential format with 1 line per sequence. If so then the first 10 positions are devoted to the name of the sequence. The data can also be in interleaved format. If so then for N sequences, there are N lines each beginning with a name in the first 10 positions followed by some sequence. After N lines there is an empty line, followed by N more lines containing aligned data (but no names). This pattern is repeated until the final block of N lines.

_______________________

 

 Running the Program              Return to Contents

_______________________

 

The program file should reside either in the same folder as the data file or in a folder automatically searched by the operating system. The user starts the program simply by going to the folder where the data file and the program exist and typing the name of the program (e.g. 'sites'). The program asks several questions about the data file and the desired analysis. Nearly all commands and options can also be entered using command line parameters. 

 

On a PowerPC, clicking on the program icon opens a small window in which command line parameters can be entered.  The user can also just hit return at this point and the program will request runtime parameters.

 

_______________________

 

 Menus                           Return to Contents

_______________________

 

     Site choice menu.

     -----------------------

 

If the 'S' flag is not used when the program is started, then the following menu will appear.

 

SITE OPTIONS ARE : a for All site types

                   i for all noncoding base changes

                   e for all coding base changes

                   s for synonymous coding sites

                   r for replacement coding sites

                   o for  ambiguous coding sites

                   d for insertion/deletion (indel) sites

                   f for only informative sites

                   n for only transitions

                   v for only transversions

 

                   z to skip all positions with more than 2 base values

                   x to skip all positions within indels

TYPE THE LETTERS OF THE DESIRED SITE TYPES (no spaces):

 

These options determine the information that is included in the polymorphism table, and they may affect which kinds of polymorphic sites are subject to other analyses. The user should enter a string of characters corresponding to those types of polymorphic sites that are to be included in the analyses. For example, an analysis on only informative synonymous sites would be called for by entering 'fs' or 'sf'. The most common analysis is done on all types of polymorphic sites (i.e. 'a'); 'ambiguous coding sites' are those for which the program could not determine whether a coding base change was a replacement or a synonymous change. 

'z' will cause all those positions with more than two types of base values to be excluded from the analysis.

'x' will cause all those positions that are within regions for which some of the lines differ by indels to be excluded from the analyses.

By invoking both 'z' and 'x', the data set can be reduced to just those sites that fit a simple infinite sites model of mutation.

 

     Analysis choice menu.

     -----------------------------

 

If the 'A' command is not used when the program is started, then the following menu will appear.

 

ANALYSIS TYPES ARE: 

        a for all basic types (except r,m,g,l)

        s for basic site table

        e for coding change details

        c for codon usage tables

        p for polymorphism analysis

        i for indel site table & indels in recombination analysis

 

        r for recombination analysis

        m for population model fitting

        l (L) for linkage disequilibrium analyses (invokes r)

        g for GC analyses

TYPE THE LETTERS OF THE DESIRED ANALYSES (no spaces):

 

·        'a' calls for the most basic analyses, including s,e,c,p & i

·        's' calls for a table of polymorphic base positions to appear in the output file.

·        'e' calls for a table that contains the codons that are the sites of synonymous, replacement, and ambiguous polymorphisms. This is especially useful for determining the likely sequence of mutations when multiple polymorphisms occur within a codon.

·        'c' calls for a table of codon usage. Only the 'universal' code is used. Two tables are printed, one for the first sequence in the data file, and one for all of the sequences. Also included are counts of the numbers of synonymous and replacement bases (i.e. relative proportions of random mutations expected to cause synonymous or replacement changes). These counts of synonymous and replacement bases can be used to calculate the number of synonymous polymorphisms per synonymous base. Similarly, the counts of replacement bases can be used to calculate the number of replacement polymorphisms per replacement base.

·        'p' calls for a set of analyses on polymorphism within and among groups.

·        'i' calls for a polymorphism table for indel variation. This option also causes indels to be included in some recombination analyses, if the 'r' option is also used.

·        'r' calls for a set of recombination analyses.

·        'm' calls for fitting the ISOLATION SPECIATION MODEL to pairs of groups of sequences, and the POPULATION SIZE CHANGE MODEL to each group of sequences. These analyses are not applicable for many data sets, and they can be slow for data sets with large groups.

·        'l' (‘L’) calls for the Linkage Disequilibrium Options Menu to appear.

·        'g' calls for a table of GC content by codon position and for noncoding regions, for each of the sequences.

 

     Data And Output Options Menu

     -----------------------------------

 

If the 'O' command is not used when the program is started, then the following menu will appear. Several of these commands require additional information that must be entered in response to queries, or provided with other command line flags.

 

DATA OPTIONS:   g for comparisons within groups

                n for new group designations

                d for dropping some sites

                m for data limited to some sites

                x for dropping some sequences

                c for non-canonical genetic code

OUTPUT OPTIONS: s for no screen output during analysis

                p suppress large pairwise difference table

                h text line of input values for HKA program

                w text line of input values for WH program

                f for first sequence reference in site table

                t for table style: (. and -) replace (- and *)

TYPE THE LETTERS OF THE DESIRED OPTIONS (no spaces):

 

  • 'g' means that polymorphism and recombination analyses will be performed on each group of sequences individually. If 'g' is not entered then the entire data set is treated as a single group. This option is ignored if there are no groups in the data set.
  • 'n' provides the opportunity to redesignate group assignments. These new groupings must still be consistent with the order of the DNA sequences in the data file. A series of questions will prompt the user to provide the necessary information.
  • 'd' provides the opportunity to drop some of the base positions from inclusion in the analysis. A series of questions will prompt the user to provide the necessary information. There are three ways to drop base positions, chosen in response to the following query that will appear on the screen.

 

DROPPING SOME SITES, enter option : r for range

                                    f for file

                                    i for individual entry

The user should enter 'r','f', or 'i', and follow the directions. In the case of 'f', the user should previously have created a file that contains base positions that are to be dropped. The format is simply a list of numbers, one number per line.  In the case of 'r' or 'i', the program asks the user for information on the range limits, or individual base positions, respectively.

 

  • 'm' provides the opportunity to limit the analyses to just a subset of the base positions. This option works almost exactly like 'd', but can be more convenient under some circumstances. By using the 'm' option it is possible to examine just short sequences. For example, if a sliding window analysis is desired, the program can be run repeatedly using a different interval in each case. If just coding, or non-coding sequence is desired, then the 'd' or 'm' options are not needed. Instead simply specify 'e' or 'i' in the Site Choice menu.
  • 'x' provides the opportunity to exclude some of the sequences from the analyses. This options works exactly as the 'd' option -  either a range or a file or individual entry can be used to indicate which sequences to exclude. This option allows for the use of very large data sets. One large data set can be maintained for a set of aligned sequences, and then specific analyses that use just a subset of the sequences can still be run. The program will maintain the appropriate group designations, even if one or more groups are entirely excluded. 
  • 'c' calls for an alternative genetic code to be used.  If this option is used then the following  menu appears that lists the most frequently used alternative codes. Selection of one of the alternatives causes that code to be used in all codon related analyses. The codes are given at the NCBI site http://www.ncbi.nlm.nih.gov/htbin-post/Taxonomy/wprintgc?mode=c
    • Alternate (Non-Canonical) Genetic Code:
      • V    Vertebrate Mitochondrial Code
      • Y    Yeast Mitochondrial Code
      • M    Mold, Protozoan, and Coelenterate Mitochondrial Code
      • I      The Invertebrate Mitochondrial Code
      • C     The Ciliate, Dasycladacean and Hexamita Nuclear Code
      • E     The Echinoderm Mitochondrial Code
      • P     The Euplotid Nuclear Code
      • A     The Alternative Yeast Nuclear Code
      • D     Ascidian Mitochondrial Code
      • F     Flatworm Mitochondrial Code
    • TYPE THE LETTER OF THE DESIRED GENETIC CODE:
  • 's' no information is output to the screen during the run.
  • 'p' suppresses the output of the matrix of pairwise differences between individual sequences. This matrix can be useful for a close look at things, but with many sequences it is quite large.
  • 'h' causes the generation of text that is suitable for input into a program called HKA.  This program does HKA tests (Hudson, Kreitman and Aguadé, 1987) and related analyses on multiple loci.
  • 'w' causes the generation of text that is suitable for input into a program called WH.  This program does analyses described in Wakeley and Hey(1997) and Wang, Wakeley & Hey (1997).
  • 'f' calls for the polymorphism table to use the first DNA sequence in the data file as the reference. The default is to use a consensus sequence (i.e. the most common base among all of the sequences) as the reference sequence in the polymorphism table output.
  • 't' changes the characters that are used in the polymorphism table. The default is for '-' to refer to a base value that is identical to the reference sequence for a given polymorphism and to use '*' to refer to a gap in the DNA sequence relative to other DNA sequences, or relative to the reference. By using the 't' option, the table will use '.' to refer to a base that is identical to the reference and '-' to refer to gaps.

 

     Linkage Disequilibrium Options Menu

     -------------------------------------------

 

If the 'L' command is not used on the command line, and the 'L' option is listed in response to the Analysis Choice Menu then the following menu  will appear. Some of these commands require additional information that must be entered in response to queries, or provided with other command line flags.

 

Linkage Disequilibrium (LD) Analyses:

         m print matrix of pairwise values and significance tests

         y measure and compare average LD by regions

         t randomization tests of average LD by regions

         s analyze LD among polymorphism shared between groups

         x exclude singletons from all LD analyses

 

Apply Analyses to the Following LD Measures

         d  D - standard linkage disequilibrium (default)

         p  D' -  D prime = D/Dmax

         b  |D| - absolute value of D

         a  |D'| - absolute value of D prime

         r  correlation coefficient

         q  r^2 - squared correlation coefficient

 

TYPE THE LETTERS OF DESIRED ANALYSES AND MEASURES (no spaces):

 

  • 'm' causes matrices of LD values to be printed in an additional output file.  That filename is either given at the command line ('-E' flag) or is requested following the appearance of the Linkage Disequilibrium menu.  For every group there is a matrix of statistical tests, including Fisher's Exact Test and a chi-square test. Also for every group, and every LD measure, there will be a matrix showing LD values for all pairs of sites.
  • 'y' causes an analysis of mean LD values within and between regions of the sequence.  These regions must be given in a file, the name of which is either given on the command line ('-G' flag) or is requested after this menu.  For an analysis with R regions the file should have R numbers, in order from low to hi, with each number being the last base of a region.  The very last number should be identical to the full length of the sequence.
  • 't' causes statistical tests, via simulation, of mean LD values within and between regions. The simulations can be slow (see below).
  • 's' causes analysis of LD among polymorphisms shared between species.
  • 'x' causes all LD analyses to be limited to only those pairs of sites for which neither polymorphism is of the singleton type (i.e. the rarer base appears only once). LD measures vary in the sensitivity to the presence of singletons.
  • 'd','p','b','a','r','q'  These measures are conventional (see e.g. Hedrick 2000).  For each one that is used, all of the LD analyses are done.  Quite a lot of output can be generated by calling for multiple LD measures.

 

_______________________

 

Command line usage             Return to Contents

_______________________

 

Nearly all commands and options can be given at the command line.  Usage of the command line permits many analyses to be automated, so that they can be very easily repeated and, if desired, given in batch files. 

 

Primary Command Line Flags:

 

Each command line parameter flag may be upper or lower case and may be preceded by a '-', '\' or '/'. Following a command line parameter is a string  of characters appropriate to that command.

 

Flag  parameter string                            example

----------------------------------            ----------------

I    the name of the data file                -Imydata.seq

R    the name of the output file              -Rmyresult

M    message up to 78 characters (no blanks)  -Mmy_data

S    kinds of polymorphic sites to analyze    -Sa   (see Site Options Menu above)

A    kinds of analyses to perform             -Aspr (see Analysis Menu above)

O    data and output options                  -Ogc  (see Data and Ouput Menu above)

L    linkage disequilibrium options           -Lyrp (see LD Analysis Menu)

C    alternate genetic code                   -Cf   (see Alternative Genetic Code Menu above)

 

Any of these parameters can be given at the command line when the program is started. Either uppercase or lowercase letters may be used for flags and for their parameter strings.  For example, the following command line, entered in a command prompt window will generate a file called myresult.sit.

 

sites -isitestestdata -rmyresult -mmy_data -sa -aspr -ogc -cf

 

For each of the parameter flags that are not included at the command line  when the program is started, the program will ask for the information. It is  not required that any of the command line parameter flags be used. The program  will also ask for more information if phylip style data formats are used, and  if some of the analysis options are used. The data file can have any name,  though it should not include the folder information. This usually means  that the program and the data need to reside in the same folder. The name of the output file should not have an extension. The characters '.SIT' will be  added to the output file name that is given to the program. Thus, for the  example above, the program would produce a file called 'myresult.sit'.

 

Three of the flags ('S', 'A', and 'O') correspond to menus (see below).  If the flag is not used at the command line, a menu will appear, requesting  user input.

 

Secondary Command Line Flags:

 

Additional command line options can be used to limit the size of the data set if the 'O' flag or the ‘L’ flag is used.  These secondary flags should be placed after the respective primary flag. 

 

Note that command line flags for dropping sequences can be used either in conjunction with flags for excluding base positions, or with flags for keeping  certain base positions.

 

  • If the 'O' flag (see Primary Command Line Flags) is on the command line  with the 'd' option, in order to exclude some base positions from the  analyses (see Analysis Options 'd' below):
    • 'D' flag can be included on the command line to specify the filename with base positions to be dropped. Immediately following the '-D' should be the name of the existing file. For example: '-Dbasedrop.txt'.
    • or an 'X' flag can be included on the command line to specify the range of base positions to be excluded. Immediately following the '-X' should be a range with a '-' separating the two extremes, and without any spaces.  For example: '-X100-200'.
  • If the 'O' flag is on the command line with the 'm' option, in order to  limit the analyses to just a subset of base positions  (see Analysis Options 'm' below):
    •  a 'K' flag can be included on the command line to specify the filename with the base positions to be used in the analysis. Immediately following the '-K' should be the name of the  existing file.  For example: '-Kbasekeep.txt'   
    • or a 'Y' flag can be included on the command line to specify the range of base positions to be kept. Immediately following the '-Y' should be a range with a '-' separating the two extremes, and without any spaces.  For example: '-Y201-300'.
  • If the 'O' flag is on the command line with the 'x' option, in order to  drop some sequences from the analyses  (see Analysis Options 'x' below):
    •  a 'Q' flag can be included on the command line to specify the filename with the sequence numbers to be dropped. Immediately following the '-Q' should be the name of the  existing file.  For example: '-Qdropseq.txt'.
    • or a 'Z' flag can be included on the command line to specify the range of sequences to be dropped. Immediately following the '-Z' should be a range with a '-' separating the two extremes, and without any spaces.  For example: '-Z4-12'.

 

  • If the 'L' flag is in the command line with the 'm' option, the program  the program will print out LD matrices to another file.  The name of that  file can  be identified at runtime, or else the program will ask for the filename.
    • an'E' flag can be included on the command line to specify the filename to which the LD matrices will be printed. Any existing file of that name will be overwritten. Immediately following the '-E' should be the name of the file. For example –Emyldmatrices
  • If the 'L' flag is in the command line with  the 'y' option, the program will generate an analyses of LD by regions  of the sequence.  To do this it needs input on the boundaries of regions.  These are contained in another file, which can be identified  at runtime, or else the program will ask for the filename.
    •  a 'G' flag can be included on the command line to specify the filename that contains the boundaries of regions for measuring and comparing mean LD. Immediately following the '-G' should be the name of the existing file. For an analysis with r regions the file should have r numbers, in order from low to hi, with each number being the last base of a region.  The very last number should be identical to the full length of the sequence.

 

_______________________

 

Output                          Return to Contents

_______________________

 

When running, the program writes some messages to the screen (unless the 's' ANALYSIS OPTION is used). While reading in the data, the program writes  the base position of each polymorphic site as it is found. Following this, the program writes a brief message for each category of analysis being conducted.

 

The output of the program is all contained in one file that has a '.SIT' extension. This file can be quite long for large data sets and complete analyses. If there are many sequences, the 'p' option in the OUPUT OPTIONS menu can cut down on the size of the data set.

 

For the most part, the formatting of the output is not hard to follow. At the top of the file, the run parameters are listed, as are the sequence and group names. For a full analysis of a data set with multiple groups,  the following headings are generated:

 

     Table of Polymorphic Sites

     -------------------------------

This table provides the base position and type of polymorphism for all variable base positions in the data set (excluding those sequences and regions of sequences that were not included in the analysis).

 

     Counts of Site Types

     -----------------------

An approximate list of counts of the number of base positions associated with each kind of polymorphisms in the table. These counts are rough because some base positions may have multiple kinds of polymorphisms.

 

     Indel Tables

     --------------

A set of two tables:  The first is very similar to that for polymorphic sites, except that each distinct indel (regardless of length) gets just one position. The second table lists the sequence and position of each distinct indel.

 

     Table of Synonymous, Replacement and Uncertain Exon Changes

     --------------------------------------------------------------------------

A list of all (or most) coding region base changes. With each polymorphism the different codon states are shown. This can be used to resolve cases where multiple changes have occurred within the same codon, and where it was not clear to the program whether a change was synonymous or replacement.  This analysis may not be complete if there are three or four bases segregating at a position, or in cases where a single aligned codon has many amino acids segregating.

 

     Codon Usage Table

     ---------------------

A table of codon usage for the first sequence. Also included are counts of the numbers of synonymous and replacement sites (i.e. relative proportions of random mutations expected to cause synonymous or replacement changes).  The calculation of these values is best explained with an example. Consider a  codon (e.g. AAA for lysine) and consider each base in the  codon.  For each base, the fraction of all mutations that  will change the codon in such a way that the amino acid does  not change is counted. For AAA, 0 of the three possible  mutations at the first base will lead to a synonymous change, similarly 0 for the second base, and 1/3 of the mutations for  the third base (because an A-> G change leads to a AAG which  is also lysine).  So the number of synonymous sites in an AAA  codon is 1/3. The number of replacement sites is 3-1/3 = 2 2/3. Every codon gets a score this way, and the final tally is just the sum of scores for all codons.

 

   Codon Usage Table For All Lines

   -------------------------------------

A table of codon usage for all sequences and counts of synonymous and replacement sites summed across all sequences.

 

     Polymorphism Analyses

     --------------------------

Most analyses are applied only to those types of sites that  are specified in the SITES OPTIONS menu, but there are exceptions - see below.     

·        BASE PAIR COMPARISONS - not including N's or indels - counts of the number of bases compared and the number of  differences for all pairs of sequences. Essentially a  distance matrix. The counts of the numbers of bases compared are based on the entire sequence, or a shorter region, depending on the 'x' SITE OPTION, and the 'l' and  'm' ANALYSIS OPTIONS.  The counts of the number of bases  compared are not reduced by other SITE OPTIONS choices.  However, the counts of site differences are affected by SITE OPTION choices. This table is not generated if the 'p' option is used in the OUTPUT OPTIONS.

·        GROUP DIFFERENCES - not including N's or indels -a matrix with group by group comparisons. Above and on the  diagonal are the average pairwise differences for those  sites specified under SITE OPTIONS. Below the diagonal is the net average pairwise divergence (e.g. Nei, 1987, p.276).

·        GROUP DIFFERENCES PER BASE PAIR - same as above but numbers are per base pair. Note that the divisor is the average # of base pairs compared, calculated from the same numbers above the diagonal in the BASE PAIRS COMPARISON table. Thus these per base pair measures may be misleading for some SITE OPTIONS  choices. For example, if only synonymous sites are analyzed (option 's' under SITE OPTIONS), a per base pair measure of divergence can be obtained by dividing numbers in the GROUP DIFFERENCES matrix, by the estimated number of synonymous sites, that are given beneath the CODON USAGE TABLE.

·         FIXED DIFFERENCES  - A fixed difference is a polymorphic site at which all of  the sequences of one group are different from all of the sequences of a second group.

·        SHARED POLYMORPHISMS  - A shared polymorphism is a polymorphic site at which each of two groups of sequences are found to have at least two of the same bases.

·         Fst AND POPULATION MIGRATION RATES - Fst values, between pairs of populations, and estimates of  Nm, assuming diploidy (i.e. N is the effective number of diploid individuals) calculated according to  equation 4 (except using a factor of 1/4) of Hudson et  al., 1992. This estimate should be multiplied by 4/3  to get the corresponding number for an X linked locus. It  should be multiplied by 2 to get the corresponding number for a haploid locus. Also if the locus is haploid and  sex-limited (e.g. mitochondria), the estimate should be  multiplied by 2, and then it applies only to the number  of individuals of the sex that carry the locus.

·        POLYMORPHIC SITE FREQUENCIES PER GROUP – FOLDED  & ROOTED -These tables give the counts of the number of lines that carry a polymorphism of a certain frequency. For example, a site in which two sequences are different from the remaining n-2 sequences is counted in either category 2 or category n-2  (depending on whether the distribution is folded, or if  not depending on the value of the root sequence.  There  are two tables, one for the folded distribution which  shows only the frequencies for the rarest bases, and one for the rooted distribution.  A rooted distribution is not shown if the analysis includes only one group. The table for the rooted  distribution is based on an outgroup sequence chosen from one of the other sequence groups.  The method for picking  outgroups is very simple. The outgroup sequence for one  group is the first sequence listed in the most divergent other group.  This may not be an ideal outgroup for various  reasons. Some properties of these distributions are known for  some models and the distributions are useful for considering questions about changing population size and natural selection. See papers by Tajima and also by Fu.

·        SITES WITH MORE THAN TWO BASES SEGREGATING - A number of analyses assume an infinite sites model, under which a polymorphic site is caused by exactly one mutation and there can be no more than two bases segregating. This table lists those sites that are clearly not consistent with this assumption. If desired, the program can be rerun using analysis option 'd' to exclude these sites.

·         D STATISTICS  - These indices are measures of departure from a neutral Fisher-Wright model. See Tajima (1989) and Fu and Li (1993).  These statistics rely on a count of the number  of polymorphic sites.  The counts that are used come directly from the site frequency distribution. However  these counts will underestimate the actual number of  mutations, particularly if some sites are segregating more than two bases within a sequence sample group. For  Fu and Li D, an outgroup sequence is picked from among  the other groups in the data set (if any occur - see above for POLYMORPHIC SITE FREQUENCIES).  Note, the outgroup sequence that is picked may not be ideal, depending on the divergence among  groups. Also given are counts of the different classes of mutations, as defined by Fu and Li.

·        THETA (4Nu) ESTIMATES - Two different estimates of the neutral mutation parameter 4Nu: Watterson’s and nucleotide diversity or pi.  See, for example, Hudson (1990) or Tajima (1993). Also listed are the number of sequences for each group and the number of bases, which is calculated by taking the average of the number of bases compared in the  pairwise comparison matrix.

      

   Historical Population Model Fitting

   ----------------------------------------

This section provides the results of the fitting of specific  historical models to the data. These models are described in Wakeley and Hey (1997). The methods work best for data  sets for which the models are a close fit to reality, and for which there are many polymorphisms and lots of  recombination. For most data sets, the models do not fit very well, and the fitting methods are not able to return a reasonable solution.   

  • ISOLATION SPECIATION MODEL FITTING  
    • NOTE - with the distribution of the WH program, this routine is now out of date. An improved Isolation Model Fitting analysis can be done by the WH program.
    • This is an analysis in which, for each group of sequences, a  specific speciation model is fit to the data.  A table is  printed in which the observed numbers of exclusive,  shared, and fixed polymorphisms are given for each pair of sequence groups.  Also given are the parameter estimates  under the Isolation Speciation Model, and the expected  values of the polymorphism types under that model. The  model is described in detail in Wakeley and Hey (1997).  The main assumptions are that two species, each with  constant population sizes, underwent speciation at a  single time point, and that prior to that time the common ancestor also had a constant population size. The model  has four parameters: Theta1 (the neutral mutation  parameter 4N_1u for group 1); Theta2 (4N_2u for group 2);  ThetaA (4N_Au for the ancestral species); and Tau (2Tu,  where T is the number of generations since speciation.  Note that Tau/Theta1 = T/(2N_1) the time in units of 2N_1 generations). This analysis does not really check to see  if the model fits the data. It simply provides estimates  assuming the model is true.  There are several situations in which the parameters either cannot be estimated or the  estimates cannot be trusted.  No estimates are made if  there are either zero shared polymorphisms or zero fixed  differences. Also there are a variety of patterns that can  arise in the data and for which suitable parameters cannot  be found.  In general these can be identified because the  expected values of the observations will not equal the  observations.  The program will point this pattern out,  if it occurs. One common data pattern that causes this  is if the shared polymorphisms are high relative to the  exclusive or fixed classes.  This is a reasonable  occurrence if the sequence groups have had migration  between them, but it is very unlikely if the Isolation  Speciation Model is true. This analysis works best on data sets that have lots of  polymorphism and lots of recombination.  The effect of  recombination is to drop, considerably, the variance of  the observed numbers of polymorphisms, so that if the  model is true, the observed  numbers will tend to be  close to those expected of the true  parameters. Without  recombination, the high variance of  the observed  polymorphism classes can cause a poor fit  of the model,  or extreme parameter estimates, even if the model is true. 
  • POPULATION SIZE CHANGE MODEL FITTING - This is an analysis in which, for each group of sequences, a  specific model of a change in population size is fit to  the data.  The table in the results file contains for  each group, the observed numbers of polymorphisms in  classes S_1, S_2 and S_3.  In general S_i = the numbers of polymorphisms in frequency class i,7-i (except when the  total sample size is 6 when S_i is the numbers for class  i,6-i).  When a group has more than 6 sequences, the full  site frequency distribution (see POLYMORPHIC SITE  FREQUENCIES PER GROUP) is reduced to the expected values  for just these three classes as if the samples where of  size 7.  The use of 3 classes and a sample size of 7 is  for convenience in estimating the three parameters under  the model (Wakeley and Hey, 1997).  The model assumes that there was an ancestral constant population size, a time of change of population size, followed by a new recent  constant population size. The three parameters are Theta  (recent 4Nu), ThetaA (ancestral 4N_Au) and Tau (2Tu, where  T is the number of generations since speciation. Note that  Tau/Theta1 = T/(2N) the time in units of 2N generations). The model, and the model fitting, have much in common with the ISOLATION SPECIATION MODEL.  The same cautions about  polymorphism and recombination apply.  The more of both,  the better. It turns out that there are many values for  S_2 and S_3 (relative to S_1) that are simply not  permitted under the model. For example it is not possible to have an expected value of either S_2 or S_3 that is  greater than S_1, and so if this observation occurs, the  model cannot be fit. 

 

     Recombination

     -----------------

 

·        POPULATION RECOMBINATION PARAMETER (4Nc) ESTIMATES  - Two estimates of 4Nc the population recombination rate are generated: gamma (Hey & Wakeley, 1997) and Hud4Nc (Hudson 1987).  c is the crossing over rate for the sequenced region  per generation. These estimates assume diploidy, and  must be adjusted otherwise. Also listed is the estimated  ratio of the number of recombination events per mutation  (c/u). This is calculated simply by dividing gamma by Theta  (i.e. estimated 4Nc, divided by estimated 4Nu).

 

A very important assumption of these methods is that just a  single mutation has caused each polymorphism within a group. gamma in particular is quite sensitive to a violation of  this infinite sites assumption.  Hud4Nc may not be as  sensitive, however this estimator suffers from extreme  bias (estimates tend to be way too high) if the numbers of sequences or polymorphisms are low (Hudson, 1987).

 

·        CONGRUENCY AND INTERVAL TABLES   For each group, two tables are printed out.

o       TABLE OF INFORMATIVE SITES. - This is a matrix that I sometimes find useful for thinking about recombination. It lists all informative polymorphic sites and the values carried by each sequence. A site is considered informative if there are at least two bases that occur in at least two of the sequences. 'rarecount' is the number of times the less common base occurs. The matrix tells whether each pair of sites is congruent. Two sites are congruent if together they do not reveal all four gametic types. For example a G-C polymorphism and an A-T polymorphism would be congruent if there occur individuals of just three of the possible gametic types (e.g. GT, GA, and CT, but not CA). Under the infinite sites model all four types can only occur if there has been crossing over or gene conversion. The matrix has a '*' or a '+' at pairs of congruent sites, and a blank for incongruent pairs. Each site with itself gets an 'S'. Thus a completely congruent data set would be all S's and *'s. The more recombination the more there is a tendency for blocks of *'s and +'s to be broken up. It can be a handy visual aid, but is unwieldy with too many polymorphisms. It also assumes an infinite sites mutation model.  The matrix can be enormous if there are  a great many polymorphisms.  The program will only print  up to 245 columns (i.e. 245 informative polymorphic  sites).  The matrix does not deal well with sites that exhibit more than 2 bases in a group. One of the bases gets arbitrarily lumped with one of the others.

 

o       MINIMAL SET OF RECOMBINATION INTERVALS  - This table lists the smallest set of locations of  recombination events in the history of the sample.   See Hudson and Kaplan (1987) for full details.  Under the  model, each interval must bracket at least one  recombination event.

 

     Linkage Disequilibrium Analyses

     -------------------------------------  

·        AVERAGE LINKAGE DISEQUILIBRIUM BETWEEN REGIONS   - If the 'y' or 't' option is listed in the LD Menu, then an analysis is conducted of the mean LD between regions of the sequence.  These regions are listed in a separate file, the name of which is either given on the command line (-G flag) or requested.  For each region, and each pair of regions the mean LD is calculated between all pairs of sites. These values are listed in a matrix that also shows separately for each group how many sites could be compared for each mean  (if there are less than 10 pairs of sites, the mean is not shown).

 

A different table is generated for every LD measure that was called for (LD MENU or '-L' command line flag).

 

If the 't' option is listed in the LD Menu, then a randomization test is conducted of the mean LD values. For each randomization mean LD is calculated within and between  each region, and the simulated mean values are compared to the  actual values. For randomizations within regions, the base values for each  polymorphic site within a group are randomly shuffled among  the sequences of that group.  For randomizations between regions the complete sequences for each region, within each group, are randomly shuffled so that the sequences remain intact but they linkage with respect to other regions is randomized.

 

A total of 1000 LD region randomizations are done. If screen  output is not suppressed the number of simulations are listed on the screen. These tests can take seconds or days depending on the size of the data set and the speed of the computer, so it is best not to turn of screen output.

 

the test results are shown in the tables of means, with symbols to indicate the level and direction of significant departure  from random. 

 

·        MATRICES OF LINKAGE DISEQUILIBRIUM VALUES AND TESTS -These matrices will be large if there are many polymorphic sites within a group. They are printed out in a separate file, the name of which is either given on the command line ('-E' flag) or is requested by the program in response to the 'm' option in the LD    OPTION MENU.  For each group there is a matrix of tests of association. Fisher's Exact Test is shown above the diagonal, and a chi-square test is shown below the diagonal.

 

For each group and each LD measure requested, a matrix is given that shows the value of LD between all pairs of sites for which values could be calculated.  Values are shown to just 1 figure so as to have the matrix in manageable space.

 

Following each matrix is given the mean LD value for that matrix.

 

·        LD ANALYSES AMONG  SHARED  POLYMORPHISMS  - A series of matrices are listed, each containing LD calculations involving shared polymorphisms between species. These can be useful in trying to figure out whether shared polymorphisms are due to gene flow or due to shared ancestry.

 

     GC Content Analysis

     -----------------------

A table of the number of bases at each codon position, and the  GC percentage for each DNA sequence.

 

     Formatted Values for Input to HKA Program

     --------------------------------------------------

The HKA program carries out the HKA test (Hudson et al., 1987) for multiple loci. It can be conducted for pairs of species or for one species and a single outgroup sequence.  HKA program input lines are generated for all possible pairs of groups.

 

     Formatted Values for Input to WH Program

     ------------------------------------------------

The WH program carries out the analyses described in Wakeley and Hey  (1997) and in Wang, Wakeley and Hey (1997). WH input lines are generated for all possible pairs of species.

 

_______________________

 

 Program Limitations             Return to Contents

_______________________

 

The program requires that the DNA sequences be aligned prior to analysis. SITES does not carry out any phylogenetic analysis. It does not estimate trees, and it does not endeavor any assessment of multiple hits.  In general the situations for which SITES is intended are ones in which there have not  been many cases of multiple mutations at base positions and where there have  not been so many insertion/deletion events that alignment is difficult.

 

The program can handle large amounts of data, depending on available RAM.  If there is sufficient RAM then the distributed version of the program  can handle up to 200 sequences, each of 20,000 base pairs, and up to 20 groups.  These values can be increased by changing constants in sites.h and recompiling the program.  

 

The program is quick for basic polymorphism analyses. Recombination,  Population Model Fitting, and Linkage Disequilibrium analyses can be time consuming for large data sets.  

 

If you have a data set that crashes or hangs the program, you can send it to me with a description of the problem. Depending on time, I'll try and  find the problem.

 

_______________________

 

 Literature Cited                Return to Contents

_______________________

 

 

Fu, Y. X., and W. H. Li, 1993 Statistical tests of neutrality of mutations.

     Genetics 133: 693-709.

 

Hey, J. and J. Wakeley. 1997. A coalescent estimator of the population

     recombination rate.  Genetics 145, 833-846.

 

Hedrick, P. W., 2000   Genetics of Populations. Jones and Bartlett, Sudbury, MA.

 

Hudson, R. R., 1987 Estimating the recombination parameter of a finite

     population model without selection. Genet. Res. Camb. 50: 245-250.

 

Hudson,R.R. 1990 Gene genealogies and the coalescent process. In: Oxford

     Surveys in Evolutionary Biology. Vol. 7. (Eds: Harvey,PH; Partridge,L)

     Oxford University Press, New York, (1-44)

 

Hudson, R. R., and N. L. KAPLAN, 1985 Statistical properties of the number of

     recombination events in the history of a sample of DNA sequences.

     Genetics 111: 147-164.

 

Hudson, R. R., M. Kreitman and M. Aguadé, 1987 A test of neutral molecular

      evolution based on nucleotide data. Genetics 116: 153-159.

 

Hudson,R.R., M.Slatkin and W.P. Maddison 1992 Estimation of levels of gene

     flow from DNA sequence data. Genetics 132: 583-589.

 

Kliman,R.M. and J. Hey 1993 DNA sequence variation at the period locus within

     and among species of the Drosophila melanogaster complex. Genetics 133:

     375-387.

 

Nei,M 1987. Molecular Evolutionary Genetics. Columbia University

      Press, New York.

 

Schaeffer,S.W. and E.L. Miller 1992 Estimates of gene flow in Drosophila

     pseudoobscura determined from nucleotide sequence analysis of the

     alcohol dehydrogenase region. Genetics 132: 471-480.

 

Tajima, F., 1989 Statistical method for testing the neutral mutation

     hypothesis by DNA polymorphism. Genetics 123: 585-595.

 

Tajima,F 1993 Measurement of DNA polymorphism. In: Mechanisms of Molecular

     Evolution. (Eds: Takahata,N; Clarke,AG) Sinauer Associates, Sunderland,

     MA, 37-59.

 

Watterson, G. A., 1975 On the number of segregating sites in genetical models

     without recombination. Theor. Pop. Biol. 7: 256-275.

 

Wakeley, J. and J. Hey. 1997 Estimating ancestral population parameters.

      Genetics 145, 847-855.

 

Wang, R. L., J. Wakeley and J. Hey, 1997 Gene flow and natural selection

      in the origin of Drosophila pseudoobscura and close relatives.

      Genetics 147: 1091-106.

 

This page last changed February 04, 2004