**FPG**
-- A COMPUTER PROGRAM FOR FORWARD POPULATION GENETIC SIMULATION--
DOCUMENTATION
Jody Hey
Department
of Genetics
Rutgers
University
Nelson
Biological Labs
604
Allison Rd.
Piscataway,
NJ 08854-8082
732-445-5272
fax
732-445-5870
https://ccgg.temple.edu/heylab/home
* This computer program and documentation may
be freely copied and used by anyone, provided no fee is charged for it.
_______________________
_______________________
______________________
_______________________
FPG (for Forward Population Genetic simulation) simulates a population of constant size that is undergoing mutation, recombination, natural selection, and (if called for) migration between populations. The meaning of "forward" in this context is simply that time, within the simulation, moves forward just as it does in the real world. This is in contrast to coalescent population genetic simulation in which time, as represented within the simulation, proceeds back into the past. Coalescent simulations have many advantages, but they are unwieldy if they incorporate natural selection on multiple sites.
FPG is useful for assessing the impact of natural selection on patterns of genetic variation. Since many selected sites can be segregating at one, the program is useful for understanding the Hill-Robsertson effect (Hill and Robertson, 1966) under genetic hitchhiking (Maynard Smith and Haigh, 1974) or background selection (Charlesworth, Morgan and Charlesworth, 1993). It is designed so as to be able to approximate real world situations with fairly large population sizes and high mutation rates over long stretches of DNA. The mutation model is an infinite sites model, meaning that no site that is segregating in the population can receive another mutation.
The basic life cycle alternates between haploid genomes (gametes) and diploid genomes (zygotes), and proceeds as follows:
Gametes are randomly paired into zygotes
Fitness is measured for each zygote as a function of the mutations it carries. If dominance is set to 0.5, then the model is similar in most respects to a haploid selection model.
Reproductive success is the number of gametes contributed to the next generation, and is determined for every zygote by drawing from a multinomial distribution with expectations that are the relative fitness values of each fitness class.
Recombination and mutation occur during the production of gametes
If there are multiple populations, gametes are drawn at random from each population and placed at random into other populations. Those gametes that are displaced by migrants are lost, so migration adds an extra element of genetic drift. This is an island-model of migration
New zygotes are formed by random union of gametes.
Generations are non-overlapping.
The current mutation model accommodates three mutation rates for mutations that are neutral, deleterious, and beneficial. The deleterious and beneficial mutations have identical selection coefficients that are used with opposite effect.
Three different fitness schemes are incorporated: additive, multiplicative, and epistatic. In each scheme, the basic parameters are a selection coefficient, s >= 0, and the dominance, 0<= h<=1.
Four quantities are measured for every zygote:
op=# of homozgyous positions for selected positive mutation
ep= # of heterozygous positions for selected positive mutation
on=# of homozgyous positions for selected negative mutation
en= # of heterozygous positions for selected negative mutation
Additive Fitness Scheme
W = 1 + (op - on) * s + (ep - en) * h*s
Multiplicative Fitness Scheme
selection coefficients:
ps = 1.0+s
ns = 1.0/ps;
dominance terms:
ph = (1.0+h*s)/(1.0 + s)
nh = (1.0+ s)/(1.0 + h*s);
W= ps^op * ns^on * (ph*ps)^ep * (ns*nh)^en
For values of s near 0, this is well approximated by
W = Exp(s*((op-on) + h*(ep-en)))
Epistatic Fitness SchemeThe epistatic model requires an additional parameter, y, that is the exponent in a power function.
The development of a general epistatic function begins by considering the exponential form of the multiplicative fitness scheme. Consider, in the absence of dominance, W=Exp(s*(op-on)). Then if we include the epistatic factor, y, we get
W=Exp(s*(op-on)^y)
However, this expression will not work if the term to be raised to the y power has a negative value. This can be dealt with directly, if awkwardly by using
W=Exp( SIGN(op-on) s *|op-on|^y))
In other words, for every increase in (op-on) the effect on fitness is more than multiplicative if y>1 (positive or synergistic epistatis) and less than multiplicative if 0<y<1 (negative epistasis).
To include dominance we simply use the corresponding expression of the multiplicative fitness scheme with dominance.
W=Exp( SIGN((op-on) + h(ep-en)) s * |(op-on) + h(ep-en)|^y )
Each gamete consists of a haploid copy of the genome. The genome consists of one or more chromosomes, each consisting of a number of segments. Each segment can hold up to 32 mutations. Large genomes with hundreds to thousands of segments can be used to allow for an overall high mutation rate, corresponding perhaps to long stretches of DNA.
There are two key considerations in selecting a genome size: simulation speed and population mutation rate. The larger the genome the slower the simulation. Also if the mutation rate is too high, for a given genome size, then all available sites become segregating and no more mutations can be added to the population, and the program halts.
Recombination occurs within chromosomes according to the value of the recombination parameter. Chromosomes sort randomly into gametes, so there is, of course, high recombination between chromosomes.
Usually there will be little need for more than 2 chromosomes in the genome, as this is sufficient to compare regions that are entirely unlinked.
The fitness and polymorphism patterns within the simulated population can be quite dynamic depending on the selection parameters and fitness scheme. In particular, population mean fitnesses may fluctuate fairly dramatically over time, particularly if recombination is low. In order to get an accurate assessment of the results for any particular parameter set, it may be necessary to run the population for a long period of time and to make many measurements.
Most processes that go on within the simulation will have the population size (G - the number of gametes) as the key rate parameter. In larger populations things happen more slowly (i.e. it takes more generations for something to happen). Thus it is usually simplest to think of time in units of G generations, where G is the number of haploid genome copies in the population.
At the beginning of any simulation the population is not segregating any mutations, and it is expected to take about 2G generations on average for a neutral mutation to reach the highest frequencies. Thus the time until the first measurement can be taken from the population should generally not be less than 2G generations.
The same considerations lead to the point that the time between measurements should also be of the order of G generations, at least if it is desired that successive measurements not be strongly correlated with one another.
G, the number of gametes in the population is an awkward parameter. Often we do not know effective population size values very well for natural populations, and so have little idea what values to use in simulations that are intended to approximate a natural situation. Another consideration is that we often expect the quantities that are to be measured in the simulation to vary in proportion to the population size, in which case it would be suitable to make the simulated population size as small as possible as size does not matter and the simulations will be fast if the size is small. However size often does matter, particularly with selection models in which the selection coefficients are near zero. In these cases it will often be necessary to consider a range of population sizes to see if the effect varies nonlinearly with G.
Together these considerations often leave an investigator feeling as if they should consider all possible population sizes. This would be fine except that increasing population size increases the simulation time (i.e. makes it slower). An increase in G entails three types of speed reductions:
The program must handle more individuals each generation
The program must be run for more generations because things happen more slowly with larger populations.
Genomes must be larger, in order for the population to accommodate the higher number of mutations that occur with larger population sizes. Larger genomes slow the program.
______________________
Downloadable Files Return to Contents
______________________
The following files are available
WIN32 package (including documentation, sample data file, and WIN32 program)
PowerPC package (including documentation, sample data file, and PowerPC program)
_______________________
Running the
Program Return to Contents
_______________________
The program file should reside either in the same folder as the data file or in a folder automatically searched by the operating system. It will usually be the case that a user will require several runs of the program using several different parameter values. For this reason the program requires just two parameters.
-D
Input file name e.g.
-Dmydata
-R Output file name e.g. -RMyresults
For example, to run the program type:
fpg -Dmydata -Rmyresults
Alternatively, just enter the program name, and it will ask for the file names.
On a PowerPC, clicking on the program icon opens a small window in which command line parameters can be entered. The user can also just hit return at this point and the program will request runtime parameters.
During the run the program will output a line of text to the screen. For a population with G genome copies, this text will be written every G generations. It contains the data set number, the generation number, the population mean fitness and the amount of room left for mutations.
The program generates a file with an extension of '.fpg' (e.g. in this example the file would be myresults.fpg).
The input file is essentially a batch file, with multiple lines, each corresponding to a command to the program. Each line of the input file will run a simulation under a set of parameters.
There are three key parameters that play a large role in whether or not a simulation is workable.
First is the size of the population, (see The Practicalities of Population Size).
Second is the genome size (the number of chromosome segments * the number of chromosomes). The genome size must be larger than the total number of mutations that will be segregating at any point in time. Unfortunately, this number is difficult to know under models with selection. The program will check initially to see if there is not sufficient room given the neutral mutation rate, and will halt the simulation if that is the case. Note that this check may be misleading as many selection models reduce the need for space for neutral mutations. The program also reports when there is insufficient room. If this occurs multiple times, the simulation is halted.
The third key parameter is mutation rate (particularly for neutral or nearly neutral mutations). The mutation rates for neutral mutations, and those that are nearly neutral must not be so high that all available mutation sites are segregating in the population.
The downloadabe version of the program calls for 5 or 6 Megabytes of memory while it is running. This memory model can accommodate a population of haploid size 1000, each with a genome length of 1000 segments. Larger or smaller memory models can be run, but require that parameters in the source code be changed, and the program recompiled.
Each line of the input file consists of a string of runtime parameters, most of which are required. The input file can have as many lines as desired. All results are saved to the output file, the name of which was given when the program was started. After the run for one parameter set is complete, the output file is saved, so that a cancellation of the program will not destroy results from completed simulations.
The parameter flags, and their meanings, are as follows (upper or lower case letters may be used):
'R' invokes the correlation coefficient
'S' invokes r^2, the square of the correlation coefficient
'D' invokes the basic gametic disequilibrium measure D
'P' invokes D' (D prime) which is equal to D/ Dmax, where Dmax is the maximum possible value of D given the observed allele frequencies.
'B' invokes the absolute value of D, |D|
FPG has a large number of parameters and options. Thus even for a very simple question, such has how much variation is present in the absence of natural selection, recombination and migration, there are still many decisions to be made. It is quite easy to input a set of parameters that are not workable for one reason or another or that generate an output that you cannot make much sense of. In order for results to be informative, it is usually best to: (1) begin with a simple model; (2) familiarize yourself with the results; and (3) build up from there. For example here is a good starting point :
-g100 -c1 -x2 -u2 -v0 -r0 -s0 -n100 -i4 -j1 -oN -k1
This runs with a small genome (e.g. 2 chromosomes, each of 1 segments, -x2 -c1), low mutation rates (-u2 -v0), no selection (-s0), no recombination (-r0), small to moderate population size (e.g. -g100), and makes enough measurements (-n100) so that the results should not be subject to too much variance.
The population model implemented in FPG is somewhat arbitrary in a number of ways. In particular, there are a number of other reasonable ways to invoke selection and migration, in particular. One basic constraint of having populations with 100's to 1000's of genomes is that many classic population genetic results that hold best when populations are very large do not hold so well when populations are small. This makes it hard to check for bugs in the code.
_______________________
_______________________
At the top of the output file is basic information regarding file names, and parameter definitions. Thereafter the results for each parameter set are presented in order.
First is printed the parameter string.
A table listing the mean population fitness values recorded at intervals during the simulation.
A comment on the type of fitness model.
A comment on the generation number at which measurements began.
A listing of runtime parameter values
A listing of some basic measurements regarding the amount of available memory used by mutations during the simulation.
A listing of the total numbers of mutations entered, lost, fixed, and finally polymorphic. Also shown is the fixation rate for each class of mutation since the time that measurements began. Note that the fixation rate for neutral mutations should be close to the population neutral mutation rate as given with the '-U' term..
A table of mean heterozygosities, Watterson's estimate of the population mutation rate (Watterson, 1975),and Tajima's D (Tajima, 1989) for a sample, all for each class of polymorphism,
If there are multiple populations and migration, then a series of tables are generated:
The mean, among measurements, of the variance in fitness among populations.
The mean variance in heterozygosity of different polymorphism classes among populations.
The average number of shared polymorphisms, exclusive polymorphisms, and fixed differences between pairs of populations.
Mean heterozygosities within and between populations, mean Fst estimates and mean Nm estimates, according to Hudson, Slatkin and Maddison (1992). When the migration rate is on the order of 1, there is a general expectation that Nm should be close to the migration rate between subpopulations.
Tables (one for each class of polymorphism) of mean fitness and heterozygosity values for different fitness classes. At each measurement, all of the zygotes in the population are ranked by their fitness and sorted into evenly sized bins. The number of bins is determined by population size such that there are not less than 5 zygotes per bin and not more than 50 bins. Mean values of individual bins are determined from all of the measurements made over the course of the simulation. The tables lists:
The average location in the fitness distribution
The mean fitness within each bin
The mean fitness within a bin as a percentage of the mean of the population.
The mean observed heterozygosity within each bin.
The mean observed heterozygosity within each bin as a percentage of the overall heterozygosity for the population.
The mean value of Tajima's D for a sample from each bin for that polymorphism class.
If Linkage Disequilibrium Analyses were invoked then the following information is provided for each type of polymorphism class and each LD measure:
The overall mean LD between all pairs of polymorphic sites in a sample for that LD measure and polymorphism class.
The mean LD between pairs of sites separated by particular distances along the chromosome, and on different chromosomes. Distances are counted in 'regions'. Each chromosome has 5 regions, and each region corresponds to a given number of bits, determined by the number of chromosome segments as defined by the '-C' parameter.
A table listing the different fitness classes, and the value of LD over different distances for that fitness class.
If the '-T' option was invoked, a pseudo DNA sequence data set, suitable for analysis by the SITES program is generated. This data set is drawn from the population at the time of the last measurement that was made in the simulation. If analyses were done for each of the polymorphism classes, then the different types of polymorphisms (Beneficial, Harmful, and Neutral) received different base values in the data set. The data set includes one sequence drawn randomly from each fitness bin.
_______________________
Program
Limitations
Return to
Contents
_______________________
The distributed version of the program is set to handle populations of up to G=1000 (i.e. 1000 gametes, 500 zygotes) and total genome length of up to 1000 segments. Since each segment has a length of 32 bits, this means that the population can hold up to 32000 mutations. With these lengths, it is possible to simulate situations that model real world genome sizes. For example you can assume that each segment corresponds to 100 base pairs and model a genome of 3.2 Megabases. This is quite workable - so long as the mutation rate is appropriate such that the mutation rate you use does not overload the genome (see Key Parameters).
For sample sizes of 100 and genome sizes of 10 segments, the simulations are rapid, and it takes most computers only a few minutes to do 50G generations (i.e. 5000 generations). Increasing the population size has a drastic effect on speed (see The Practicalities of Population Size).
If a very large number of simulations are desired for a particular population size and genome size, then it may be possible to increase speed by recompiling the program with reset values of key constants in the source code.
_______________________
Literature Cited
Return to
Contents
_______________________
Charlesworth, B., M. T. Morgan and D. Charlesworth, 1993 The effect of deleterious mutations on neutral molecular evolution. Genetics 134: 1289-1303.
Hill, W. G., and A. Robertson, 1966 The effect of linkage on limits to artificial selection. Genet. Res. 8: 269-294.
Hudson, R. R., M. Slatkin and W. P. Maddison, 1992 Estimation of levels of gene flow from DNA sequence data. Genetics 132: 583-589.
Maynard Smith, J., and J. Haigh, 1974 The hitch-hiking effect of a favourable gene. Genet. Res. 23: 23-35.
Tajima, F., 1989 Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123: 585-595.
Watterson, G. A., 1975 On the number of segregating sites in genetical models without recombination. Theor. Pop. Biol. 7: 256-275.
This page was last changed March 31, 2014