**SITES** -- DNA POLYMORPHISM ANALYSIS PROGRAM --

**FPG** -- A COMPUTER PROGRAM FOR FORWARD POPULATION GENETIC SIMULATION--

DOCUMENTATION

Jody Hey

Department of Genetics

Rutgers University

Nelson Biological Labs

604 Allison Rd.

Piscataway, NJ 08854-8082

732-445-5272

fax 732-445-5870

hey@biology.rutgers.edu

https://ccgg.temple.edu/heylab/home

* This computer program and documentation may be freely copied and used by anyone, provided no fee is charged for it.

_______________________

Overview
Downloadable Files
Running the Program
Output
Program Limitations
Literature Cited

______________________

Overview

_______________________

FPG (for Forward Population Genetic simulation) simulates a population of constant size that is undergoing mutation, recombination, natural selection, and (if called for) migration between populations. The meaning of "forward" in this context is simply that time, within the simulation, moves forward just as it does in the real world. This is in contrast to coalescent population genetic simulation in which time, as represented within the simulation, proceeds back into the past. Coalescent simulations have many advantages, but they are unwieldy if they incorporate natural selection on multiple sites.

FPG is useful for assessing the impact of natural selection on patterns of genetic variation. Since many selected sites can be segregating at one, the program is useful for understanding the Hill-Robsertson effect (Hill and Robertson, 1966) under genetic hitchhiking (Maynard Smith and Haigh, 1974) or background selection (Charlesworth, Morgan and Charlesworth, 1993). It is designed so as to be able to approximate real world situations with fairly large population sizes and high mutation rates over long stretches of DNA. The mutation model is an infinite sites model, meaning that no site that is segregating in the population can receive another mutation.

Life Cycle

The basic life cycle alternates between haploid genomes (gametes) and diploid genomes (zygotes), and proceeds as follows:

Gametes are randomly paired into zygotes
Fitness is measured for each zygote as a function of the mutations it carries. If dominance is set to 0.5, then the model is similar in most respects to a haploid selection model.
Reproductive success is the number of gametes contributed to the next generation, and is determined for every zygote by drawing from a multinomial distribution with expectations that are the relative fitness values of each fitness class.
Recombination and mutation occur during the production of gametes
If there are multiple populations, gametes are drawn at random from each population and placed at random into other populations. Those gametes that are displaced by migrants are lost, so migration adds an extra element of genetic drift. This is an island-model of migration
New zygotes are formed by random union of gametes.

Generations are non-overlapping.

Fitness Schemes

The current mutation model accommodates three mutation rates for mutations that are neutral, deleterious, and beneficial. The deleterious and beneficial mutations have identical selection coefficients that are used with opposite effect.

Three different fitness schemes are incorporated: additive, multiplicative, and epistatic. In each scheme, the basic parameters are a selection coefficient, s >= 0, and the dominance, 0<= h<=1.

Four quantities are measured for every zygote:

op=# of homozgyous positions for selected positive mutation

ep= # of heterozygous positions for selected positive mutation

on=# of homozgyous positions for selected negative mutation

en= # of heterozygous positions for selected negative mutation

Additive Fitness Scheme

W = 1 + (op - on) * s + (ep - en) * h*s

Multiplicative Fitness Scheme

selection coefficients:

ps = 1.0+s

ns = 1.0/ps;

dominance terms:

ph = (1.0+h*s)/(1.0 + s)

nh = (1.0+ s)/(1.0 + h*s);

W= ps^op * ns^on * (ph*ps)^ep * (ns*nh)^en

For values of s near 0, this is well approximated by

W = Exp(s*((op-on) + h*(ep-en)))

Epistatic Fitness Scheme

The epistatic model requires an additional parameter, y, that is the exponent in a power function.

The development of a general epistatic function begins by considering the exponential form of the multiplicative fitness scheme. Consider, in the absence of dominance, W=Exp(s*(op-on)). Then if we include the epistatic factor, y, we get

W=Exp(s*(op-on)^y)

However, this expression will not work if the term to be raised to the y power has a negative value. This can be dealt with directly, if awkwardly by using

W=Exp( SIGN(op-on) s *|op-on|^y))

In other words, for every increase in (op-on) the effect on fitness is more than multiplicative if y>1 (positive or synergistic epistatis) and less than multiplicative if 0<y<1 (negative epistasis).

To include dominance we simply use the corresponding expression of the multiplicative fitness scheme with dominance.

W=Exp( SIGN((op-on) + h(ep-en)) s * |(op-on) + h(ep-en)|^y )

Genome Structure

Each gamete consists of a haploid copy of the genome. The genome consists of one or more chromosomes, each consisting of a number of segments. Each segment can hold up to 32 mutations. Large genomes with hundreds to thousands of segments can be used to allow for an overall high mutation rate, corresponding perhaps to long stretches of DNA.

There are two key considerations in selecting a genome size: simulation speed and population mutation rate. The larger the genome the slower the simulation. Also if the mutation rate is too high, for a given genome size, then all available sites become segregating and no more mutations can be added to the population, and the program halts.

Recombination occurs within chromosomes according to the value of the recombination parameter. Chromosomes sort randomly into gametes, so there is, of course, high recombination between chromosomes.

Usually there will be little need for more than 2 chromosomes in the genome, as this is sufficient to compare regions that are entirely unlinked.

Making Measurements

The fitness and polymorphism patterns within the simulated population can be quite dynamic depending on the selection parameters and fitness scheme. In particular, population mean fitnesses may fluctuate fairly dramatically over time, particularly if recombination is low. In order to get an accurate assessment of the results for any particular parameter set, it may be necessary to run the population for a long period of time and to make many measurements.

Most processes that go on within the simulation will have the population size (G - the number of gametes) as the key rate parameter. In larger populations things happen more slowly (i.e. it takes more generations for something to happen). Thus it is usually simplest to think of time in units of G generations, where G is the number of haploid genome copies in the population.

At the beginning of any simulation the population is not segregating any mutations, and it is expected to take about 2G generations on average for a neutral mutation to reach the highest frequencies. Thus the time until the first measurement can be taken from the population should generally not be less than 2G generations.

The same considerations lead to the point that the time between measurements should also be of the order of G generations, at least if it is desired that successive measurements not be strongly correlated with one another.

The Practicalities of Population Size

G, the number of gametes in the population is an awkward parameter. Often we do not know effective population size values very well for natural populations, and so have little idea what values to use in simulations that are intended to approximate a natural situation. Another consideration is that we often expect the quantities that are to be measured in the simulation to vary in proportion to the population size, in which case it would be suitable to make the simulated population size as small as possible as size does not matter and the simulations will be fast if the size is small. However size often does matter, particularly with selection models in which the selection coefficients are near zero. In these cases it will often be necessary to consider a range of population sizes to see if the effect varies nonlinearly with G.

Together these considerations often leave an investigator feeling as if they should consider all possible population sizes. This would be fine except that increasing population size increases the simulation time (i.e. makes it slower). An increase in G entails three types of speed reductions:

The program must handle more individuals each generation
The program must be run for more generations because things happen more slowly with larger populations.
Genomes must be larger, in order for the population to accommodate the higher number of mutations that occur with larger population sizes. Larger genomes slow the program.

______________________

Downloadable Files Return to Contents

______________________

The following files are available

_______________________

Running the Program Return to Contents

_______________________

The program file should reside either in the same folder as the data file or in a folder automatically searched by the operating system. It will usually be the case that a user will require several runs of the program using several different parameter values. For this reason the program requires just two parameters.

-D Input file name e.g. -Dmydata
-R Output file name e.g. -RMyresults

For example, to run the program type:

fpg -Dmydata -Rmyresults

Alternatively, just enter the program name, and it will ask for the file names.

On a PowerPC, clicking on the program icon opens a small window in which command line parameters can be entered. The user can also just hit return at this point and the program will request runtime parameters.

During the run the program will output a line of text to the screen. For a population with G genome copies, this text will be written every G generations. It contains the data set number, the generation number, the population mean fitness and the amount of room left for mutations.

The program generates a file with an extension of '.fpg' (e.g. in this example the file would be myresults.fpg).

The input file is essentially a batch file, with multiple lines, each corresponding to a command to the program. Each line of the input file will run a simulation under a set of parameters.

Key Parameters

There are three key parameters that play a large role in whether or not a simulation is workable.

First is the size of the population, (see The Practicalities of Population Size).
Second is the genome size (the number of chromosome segments * the number of chromosomes). The genome size must be larger than the total number of mutations that will be segregating at any point in time. Unfortunately, this number is difficult to know under models with selection. The program will check initially to see if there is not sufficient room given the neutral mutation rate, and will halt the simulation if that is the case. Note that this check may be misleading as many selection models reduce the need for space for neutral mutations. The program also reports when there is insufficient room. If this occurs multiple times, the simulation is halted.
The third key parameter is mutation rate (particularly for neutral or nearly neutral mutations). The mutation rates for neutral mutations, and those that are nearly neutral must not be so high that all available mutation sites are segregating in the population.

The downloadabe version of the program calls for 5 or 6 Megabytes of memory while it is running. This memory model can accommodate a population of haploid size 1000, each with a genome length of 1000 segments. Larger or smaller memory models can be run, but require that parameters in the source code be changed, and the program recompiled.

Runtime parameters

Each line of the input file consists of a string of runtime parameters, most of which are required. The input file can have as many lines as desired. All results are saved to the output file, the name of which was given when the program was started. After the run for one parameter set is complete, the output file is saved, so that a cancellation of the program will not destroy results from completed simulations.

The parameter flags, and their meanings, are as follows (upper or lower case letters may be used):

-G G, the # of gametes or genome copies in the population. This should not be less than 20. The maximum is limited by computer memory, time, and the MAXGAMSIZE definition in the fpg.c source file. e.g. -G100
-O The types of polymorphisms to include in the analysis. '-M' is followed by one or more letters to indicate which types of polymorphism to run analyses for. Multiple letters may be included. Note that by default, the program lumps all polymorphisms. To see results for each type of polymorphism, as well as all lumped together, use '-MPHN':
- 'P' Positive (beneficial) mutations
- 'H' Harmful mutations
- 'N' Neutral polymorphisms
-C The # of segments per chromosome, each of length 32. This parameter, as well as the number of chromosomes, must be chosen so that there is sufficient room given the mutation rates to hold segregating mutations.
-X The # of chromosomes per haploid genome. In general there will not be a need to have more than 2 chromosomes.
-V The population selected mutation rate (if v is the mutation rate per gamete then this equals G*v)
-S The population selection coefficient (if s is the selection coefficient per individual, then this equals Gs) e.g. -S1
-F The proportion of selected mutations that are deleterious (between 0 and 1, inclusive). The remainder are beneficial.
-H Dominance (between 0 and 1 inclusive). A value of 0.5 corresponds to incomplete dominance.
-Y The epistasis coefficient. Values between 0 and 1 are for negative epistasis. Values greater than 1 are positive (synergistic) epistasis. If the fitness model is not epistatic, this quantity is ignored.
-W Fitness Model. A- additive, M- multiplicative, E- epistatic. e.g. -WE The default is M- multiplicative.
-U The population neutral mutation rate. If u is the neutral mutation rate per gamete then this equals G*u.
-R The population recombination rate per chromosome. e.g. if r is the rate per chromosome then this equals G*r.
-K The number of subpopulations. Each subpopulation has size G / # populations. If this is an odd number, then G is bumped up to make it even so that some gametes are not left unpaired.
-M The migration rate per subpopulation. This is the average number of genome copies that are copied from each subpopulation each generation to settle randomly in other subpopulations.
-N The # of measurements to take.
-I The length of the startup period before measurements, in units of G generations - default = 4.0 This is called the Burn-In time.
-J The length of the time between measurements, in units of G generations - default = 1.0
-T Print out a pseudo-DNA-sequence data set, in SITES format, for each set of parameters.
-Q Causes fitness measurements to include all those mutations that fixed. These absolute measures can get very extreme. If they do become extreme, then this flag is reset and fixed mutations are ignored therafter for the purpose of fitness measurement. The default state is to ignore fixed mutations when assessing fitness. . \n");
-L Invokes Linkage disequilibrium analysis along and between chromosomes. '-L' is followed by a series of letters to indicate which measures of linkage disequilibrium to use. Types of Linkage disequilibrium measures and corresponding letter (at least one must be selected to invoke the LD analysis):
- 'R' invokes the correlation coefficient
- 'S' invokes r^2, the square of the correlation coefficient
- 'D' invokes the basic gametic disequilibrium measure D
- 'P' invokes D' (D prime) which is equal to D/ Dmax, where Dmax is the maximum possible value of D given the observed allele frequencies.
- 'B' invokes the absolute value of D, |D|
-B The length of chromosome in base pairs. Not a simulation parameter but useful for some calculations when one wishes to model a real world situation.
-A A random number generator seed for use if repeated identical runs are desired. optional

Cautions

FPG has a large number of parameters and options. Thus even for a very simple question, such has how much variation is present in the absence of natural selection, recombination and migration, there are still many decisions to be made. It is quite easy to input a set of parameters that are not workable for one reason or another or that generate an output that you cannot make much sense of. In order for results to be informative, it is usually best to: (1) begin with a simple model; (2) familiarize yourself with the results; and (3) build up from there. For example here is a good starting point :

-g100 -c1 -x2 -u2 -v0 -r0 -s0 -n100 -i4 -j1 -oN -k1

This runs with a small genome (e.g. 2 chromosomes, each of 1 segments, -x2 -c1), low mutation rates (-u2 -v0), no selection (-s0), no recombination (-r0), small to moderate population size (e.g. -g100), and makes enough measurements (-n100) so that the results should not be subject to too much variance.

The population model implemented in FPG is somewhat arbitrary in a number of ways. In particular, there are a number of other reasonable ways to invoke selection and migration, in particular. One basic constraint of having populations with 100's to 1000's of genomes is that many classic population genetic results that hold best when populations are very large do not hold so well when populations are small. This makes it hard to check for bugs in the code.

_______________________

Output Return to Contents

_______________________

All of the results are placed into a single text file, which can be quite large depending on the analyses invoked and the number of line sin the input file.

At the top of the output file is basic information regarding file names, and parameter definitions. Thereafter the results for each parameter set are presented in order.

First is printed the parameter string.
A table listing the mean population fitness values recorded at intervals during the simulation.
A comment on the type of fitness model.
A comment on the generation number at which measurements began.
A listing of runtime parameter values
A listing of some basic measurements regarding the amount of available memory used by mutations during the simulation.
A listing of the total numbers of mutations entered, lost, fixed, and finally polymorphic. Also shown is the fixation rate for each class of mutation since the time that measurements began. Note that the fixation rate for neutral mutations should be close to the population neutral mutation rate as given with the '-U' term..
A table of mean heterozygosities, Watterson's estimate of the population mutation rate (Watterson, 1975),and Tajima's D (Tajima, 1989) for a sample, all for each class of polymorphism,
If there are multiple populations and migration, then a series of tables are generated:
- The mean, among measurements, of the variance in fitness among populations.
- The mean variance in heterozygosity of different polymorphism classes among populations.
- The average number of shared polymorphisms, exclusive polymorphisms, and fixed differences between pairs of populations.
- Mean heterozygosities within and between populations, mean Fst estimates and mean Nm estimates, according to Hudson, Slatkin and Maddison (1992). When the migration rate is on the order of 1, there is a general expectation that Nm should be close to the migration rate between subpopulations.
Tables (one for each class of polymorphism) of mean fitness and heterozygosity values for different fitness classes. At each measurement, all of the zygotes in the population are ranked by their fitness and sorted into evenly sized bins. The number of bins is determined by population size such that there are not less than 5 zygotes per bin and not more than 50 bins. Mean values of individual bins are determined from all of the measurements made over the course of the simulation. The tables lists:
- The average location in the fitness distribution
- The mean fitness within each bin
- The mean fitness within a bin as a percentage of the mean of the population.
- The mean observed heterozygosity within each bin.
- The mean observed heterozygosity within each bin as a percentage of the overall heterozygosity for the population.
- The mean value of Tajima's D for a sample from each bin for that polymorphism class.
If Linkage Disequilibrium Analyses were invoked then the following information is provided for each type of polymorphism class and each LD measure:
- The overall mean LD between all pairs of polymorphic sites in a sample for that LD measure and polymorphism class.
- The mean LD between pairs of sites separated by particular distances along the chromosome, and on different chromosomes. Distances are counted in 'regions'. Each chromosome has 5 regions, and each region corresponds to a given number of bits, determined by the number of chromosome segments as defined by the '-C' parameter.
- A table listing the different fitness classes, and the value of LD over different distances for that fitness class.
If the '-T' option was invoked, a pseudo DNA sequence data set, suitable for analysis by the SITES program is generated. This data set is drawn from the population at the time of the last measurement that was made in the simulation. If analyses were done for each of the polymorphism classes, then the different types of polymorphisms (Beneficial, Harmful, and Neutral) received different base values in the data set. The data set includes one sequence drawn randomly from each fitness bin.

_______________________

Program Limitations Return to Contents

_______________________

The distributed version of the program is set to handle populations of up to G=1000 (i.e. 1000 gametes, 500 zygotes) and total genome length of up to 1000 segments. Since each segment has a length of 32 bits, this means that the population can hold up to 32000 mutations. With these lengths, it is possible to simulate situations that model real world genome sizes. For example you can assume that each segment corresponds to 100 base pairs and model a genome of 3.2 Megabases. This is quite workable - so long as the mutation rate is appropriate such that the mutation rate you use does not overload the genome (see Key Parameters).

For sample sizes of 100 and genome sizes of 10 segments, the simulations are rapid, and it takes most computers only a few minutes to do 50G generations (i.e. 5000 generations). Increasing the population size has a drastic effect on speed (see The Practicalities of Population Size).

If a very large number of simulations are desired for a particular population size and genome size, then it may be possible to increase speed by recompiling the program with reset values of key constants in the source code.

_______________________

Literature Cited Return to Contents

_______________________

Charlesworth, B., M. T. Morgan and D. Charlesworth, 1993 The effect of deleterious mutations on neutral molecular evolution. Genetics 134: 1289-1303.

Hill, W. G., and A. Robertson, 1966 The effect of linkage on limits to artificial selection. Genet. Res. 8: 269-294.

Hudson, R. R., M. Slatkin and W. P. Maddison, 1992 Estimation of levels of gene flow from DNA sequence data. Genetics 132: 583-589.

Maynard Smith, J., and J. Haigh, 1974 The hitch-hiking effect of a favourable gene. Genet. Res. 23: 23-35.

Tajima, F., 1989 Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123: 585-595.

Watterson, G. A., 1975 On the number of segregating sites in genetical models without recombination. Theor. Pop. Biol. 7: 256-275.

This page was last changed March 31, 2014