Browse GeneArtisan Help |
GeneArtisan is a computer program for simulating population samples of
disease and normal chromosomes bearing multiple linked genetic markers (SNPs or STRPs)
under a model of mutation, genetic drift, selection and recombination.
The simulation method allows samples to be generated under a coalescent
process with a case-control sampling strategy and polymorphic marker
ascertainment. The method incorporates several features to ensure
a more realistic simulation to be carried out, including selection acting on
disease alleles; sample ascertainment of disease chromosomes and polymorphic markers;
a genetic dominance model of disease expression that allows incomplete penetrance and phenocopies;
and an accurate genetic map of recombination rates and hotspots for recombination in the human genome
(or alternatively, an improved method for simulating the distributions of hotspots).
References:
Population data
Number of disease individuals: the number of
disease individuals in the sample. If it is set to be 0, a usual simulation will
be carried out without the case-control sampling strategy.
Number of normal individuals: the number of
normal individuals in the sample. It must be greater than 0.
Population demography
Population size: the diploid population size in the current generation.
Growth rate: it is assumed that the population size has exponentially
grown with the given rate. The growth rate can also be set to 0 for a constant-size population.
Disease mutation age: the time (in generations) at which the disease
mutation first arose in the past. Ensure that the population size is not too small at this time, given
the present-day population size and the growth rate. The minimal limitation of the population size at
which the disease mutation first arose is 2.
Selection coefficient: is the selection coefficient of the disease
allele.
Current disease frequency button: is used for examining the
expected population frequency of the disease allele at current generation and its variance.
It is obtained by simulating the sample path (over time) of the population frequency of the disease
allele for 1000 iterations, conditional on non-extinction and non-fixation. The initial frequency of
the disease allele when it first arose is set to be 1 over the population size at that time.
Then the appropriate disease mutation age can be chosen to obtain a desired population frequency
of the disease allele in present-day generation. If the proportion of the simulated population
in which the disease allele was fixed, or lost, is more than 0.999, for the first 1000 simulations
of the sample path, the program will be terminated. It is required to modify the above 4 parameters to
continue running the simulation program.
Boundaries of the current disease allele frequency: low and high
boundaries (exclusive) of the population frequency of the disease allele need to be specified. Only the
simulations in which the current frequency of the disease allele is within the above range will be retained.
It is for the analyses require that the current disease frequency is within a certain range. If the
restriction is not needed, the two numbers will be simpliy set to be 0 and 1.
Genetic data
Marker type: the mutation rate should be between 10^(-9) and 10^(-6)
for SNP markers, and be between 10^(-2) and 10^(-5) for STRP markers. The default STRP marker
density is 55.4 for dinucleotide repeats, and is 11.8 for trinucleotide repeats
(Ramser, et al 2001).
Minimum # markers: for a simulated sample, if the number of
polymorphic markers satisfying the following marker polymorphism cutoff level is less than
the defined minimum # markers, the simulation will be discarded. The minimum # markers must be greater than 0.
Minimum allele frequency: is the marker polymorphism cutoff level.
Only the markers with the minimum allele frequency greater than or equal to the defined
value, will be retained. If the minimum allele frequency is set to be too high and the minimum number of
markers required is also large, then the proportion of rejected simulations will be increased.
The program will check the proportion of rejections, and if it is greater than 95% for the first 100
simulations, the simulation program will be terminated. More reasonable parameters should be given
considering to obtain more realistic simulated samples.
Recombination model
If a homogeneous recombination model is chosen, the assumption of 1 Mb equivalent to 1 cM
is applied. Otherwise, either the Iceland recombination data, or the simulated recombination
rates using the geometric Brownian motion (GBM) model, will be used.
Iceland data: markers' name, physical distances,
and map distances for all autosomal chromosomes have been provided by Kong et al (2001).
After the chromosome, and the left
(From), right (To) edges (in bps)
of the sampled chromosome interval, are specified, the map distances of all markers within
the specified range will be used in the simulations. In each sub-region separated by the
adjacent markers, the recombination break points are uniformly distributed.
The low and high boundaries of the physical distances of all markers, whose map distances are available in the Icelandic recombination map, will be displayed in the Region range (bp) box. Only the region between these two boundaries can be inputted as the left and right edges of the sampled chromosome interval. The interval size of the sampled chromosome interval will be automatically updated according to the physical locations of its left and right edges.
Output options
If the box of the computation of the pairwise LDs between the disease mutation and the markers
is checked, the name of the file for saving these values need to be specified in the
Results of LDs are written into the file box. The parameters and all
simulation results will be saved into the file with name given in the
Output file name box. Both files are XML format.
Others
Interval size (Mb): if the homogeneous recombination model,
or the GBM non-homogeneous model, is chosen, the length of the sampled chromosome interval
(in Mb) need to be specified, and its maximum value is 4 Mb for a population experiencing exponential growth with rate > 0.005,
and is 0.5 Mb for a population with an exponential growth rate ≤ 0.005. The maximum value is set due to the limitations of running time
and memory requirement.
Mutation location (Mb): the disease mutation location must be
within the chromosome region, between 0 and the size of the interval. The left edge of the
interval will be defined as 0, so the mutation location is the distance apart from the left edge.
If it is required that the disease mutation is located on the left of all markers, simply set the
mutation location to be 0, or set it to be the size of the interval, if
the disease mutation in located on the right of all markers is preferred.
Number of simulation replicates: the number of simulated samples
using all above input parameters and satisfy all the specified requirements.
Print recombination rates: the recombination rates along the sampled
chromosome interval, either using the Icelandic recombination data, or simulated by the GBM model,
will be printed in the text browser box, if the non-homogeneous recombination model is chosen.
Screen output:
All the inputs entered will be repeated in the output text browser box.
Also for each simulated sample, the following will be displayed:
File output
Simulation results:
All simulation replicates are saved into one .XML file, which can be parsed
by perl modules, such as the XML::Parse module. The XML file consists of 3 parts: parameters,
simulation results (separated by <Num_*> for each simulation), and note. All 3 parts are
included in the root element <DATA>. For example:
<DATA> <Parameters> ... (Parameters that have been used) </Parameters> <Num_1> (the first simulated sample) <num_loci>6</num_loci> (number of markers) <interval_size>0.224208</interval_size> (physical distance of the last marker) <genotypes_normal> (phased genotypes of normal chromosomes in the sample) <n0a>1 1 1 1 2 1</n0a> \ -> (genotypes of one normal individual) <n0b>1 1 2 2 1 1</n0b> / <n1a>1 1 2 2 1 1</n1a> \ -> (genotypes of one normal individual) <n1b>1 1 2 1 1 1</n1b> / ... </genotypes_normal> <genotypes_disease> (phased genotypes of disease chromosomes in the sample) <n0a>1 1 2 2 1 1</n0a> \ -> (genotypes of one disease individual) <n0b>2 1 2 1 1 1</n0b> / <n1a>2 1 2 1 1 1</n1a> \ -> (genotypes of one disease individual) <n1b>2 1 2 1 1 2</n1b> / ... </genotypes_disease> <phy_dis>0.068526 0.071677 0.116376 0.173379 0.209431 0.224208</phy_dis> (physical distances of markers) <dis_frequency>0.168294</dis_frequency> (current frequency of the disease allele) <founding_popsize>635.463</founding_popsize> (population size at which the disease mutation first arose) <TMRCA_diseaseTree>645</TMRCA_diseaseTree> (TMRCA of the sample of disease chromosomes) <TMRCA_all>1084.715</TMRCA_all> (TMRCA of the entire sample) </Num_1>
<Num_2> (the second simulated sample) ... </Num_2> ...
<Note> ... (general notes, including # simulations, average # recombinations per genealogy, total running time) </Note> </DATA<
<DATA> <No1>1 1 0.466667 1 1 1</No1> (Pairwise LDs for the first simulated sample) <No2>1 1 1 1 1 1 1 1 1 0.8 1 1 1 0.333333 1<No2> (Pairwise LDs for the second simulated sample) ... </DATA>