Browse GeneArtisan Help |
GeneArtisan is a computer program for simulating population samples of
disease and normal chromosomes bearing multiple linked genetic markers (SNPs or STRPs)
under a model of mutation, genetic drift, selection and recombination.
The simulation method allows samples to be generated under a coalescent
process with a case-control sampling strategy and polymorphic marker
ascertainment. The method incorporates several features to ensure
a more realistic simulation to be carried out, including selection acting on
disease alleles; sample ascertainment of disease chromosomes and polymorphic markers;
a genetic dominance model of disease expression that allows incomplete penetrance and phenocopies;
and an accurate genetic map of recombination rates and hotspots for recombination in the human genome
(or alternatively, an improved method for simulating the distributions of hotspots).
References:
Population data
Number of disease individuals: the number of
disease individuals in the sample. If it is set to be 0, a usual simulation will
be carried out without the case-control sampling strategy.
Number of normal individuals: the number of
normal individuals in the sample. It must be greater than 0.
Data type: if the phased genotype data is chosen,
the total number of the disease and normal chromosomes will be the double of the
sum of above two numbers. The number of individuals in two categories (cases and controls)
will be simulated using the disease penetrance-sampling model with the specified
penetrance parameters.
Population demography
Population size: the diploid population size in the current generation.
Growth rate: it is assumed that the population size has exponentially
grown with the given rate. The growth rate can also be set to 0 for a constant-size population.
Disease mutation age: the time (in generations) at which the disease
mutation first arose in the past. Ensure that the population size is not too small at this time, given
the present-day population size and the growth rate. The minimal limitation of the population size at
which the disease mutation first arose is 2.
Selection coefficient: is the selection coefficient of the disease
allele. There is no limitation on the selection coefficient.
Current disease frequency button: is used for examining the
expected population frequency of the disease allele at current generation and its variance.
It is obtained by simulating the sample path (over time) of the population frequency of the disease
allele for 1000 iterations, conditional on non-extinction and non-fixation. The initial frequency of
the disease allele when it first arose is set to be 1 over the population size at that time.
Then the appropriate disease mutation age can be chosen to obtain a desired population frequency
of the disease allele in present-day generation. If the proportion of the simulated population
in which the disease allele was fixed, or lost, is more than 0.98, for the first 1000 simulations
of the sample path, the program will be terminated. It is required to modify the above 4 parameters to
continue running the simulation program.
Boundaries of the current disease allele frequency: low and high
boundaries (exclusive) of the population frequency of the disease allele need to be given. Only the
simulations in which the current frequency of the disease allele is within the above range will be retained.
It is for some analyses that require the current disease frequency within a certain range. If the
restriction is not needed, the two numbers will be simpliy set to 0 and 1.
Genetic data
Marker type: the mutation rate is between 10^(-9) and 10^(-6)
for SNP markers, and is between 10^(-2) and 10^(-5) for STRP markers. The default STRP marker
density is 55.4 for dinucleotide repeats, and is 11.8 for trinucleotide repeats
(Ramser, et al 2001).
Minimum # markers: for a simulated sample, if the number of
polymorphic markers which satisfied the following marker polymorphism cutoff level is less than
the defined minimum # markers, the simulation will be discarded. The minimum # markers must be greater than 0.
Minimum allele frequency: is the marker polymorphism cutoff level.
Only the markers, which have the minimum allele frequency greater than or equal to the defined
value, will be retained. If the minimum allele frequency is too high and the minimum number of
markers required is also large, then the proportion of rejected simulations will be increased.
The program will check the proportion of rejections, and if it is greater than 95% for the first 100
simulations, the simulation program will be terminated. More reasonable parameters should be given
considering to obtain more realistic simulated samples.
Recombination model
If a homogeneous recombination model is chosen, the assumption of 1 Mb equivalent to 1 cM
is applied. Otherwise, either the Iceland recombination data, or the simulated recombination
rates using the geometric Brownian motion (GBM) model, will be used.
Iceland data: the available markers' name, physical distances,
and map distances for all autosomal chromosomes have been provided by Kong et al (2001).
The physical distances of these markers to be used for the simulations are obtained by automatically
searching UCSC Genome Browser on Human May 2004 Assembly. After the chromosome, and the left
(From), right (To) edges (in bps)
of the sampled chromosome interval, are specified, the map distances of all markers within
the specified range will be used in the simulations. In each sub-region separated by the
adjacent markers, the recombination break points are uniformly distributed.
The low and high boundaries of the physical distances of all markers, whose map distances are available in the Icelandic recombination map, will be displayed in the Region range (bp) box. Only the region between these two boundaries can be inputted as the left and right edges of the sampled chromosome interval. The interval size of the sampled chromosome interval will be automatically updated according to the physical locations of its left and right edges.
Output options
If the box of the computations of the pairwise LDs between the disease mutation and the markers
is checked, the name of the file for saving these values need to be given in the
Results of LDs are written into the file box. The parameters and all
simulation results will be saved into the file with name inputted by user in the
Output file name box. Both files are XML format.
Others
Interval size (Mb): if the homogeneous recombination model,
or the GBM non-homogeneous model, is chosen, the length of the sampled chromosome interval
(in Mb) need to be specified, and its maximum value is 6 Mb for an exponentially growed population,
and is 0.05 Mb for a constant-size population. The maximum value is set due to the limitations of running time
and memory requirement. Many aspects can affect the running time of the program.
The time for simulating the genealogy of a sample can be greatly influenced by
the current population size and especially the population growth rate.
For example, if the current population size is 1000, and the growth rate
is 0, the expected time to the most recent common ancestor of a sample of 2 normal chromosomes with
interval size of 1 Mb is 4.852*10^(9) generations.
Mutation location (Mb): the disease mutation location must be
within the chromosome region, between 0 and the size of the interval. The left edge of the
interval will be defined as 0, so the mutation location is the distance apart from the left edge.
If it is required that the disease mutation location is to the left of all markers, simply set the
mutation location to be 0. On the contrary, let it equal to the size of the interval, if
the mutation location, which is on the right of all markers, is preferred.
The reason why the mutation location beyond the interval is not allowed: the only difference
for this case is that markers in the sub-region bounded by the mutation location and the left, or right,
edge, are discarded.
Number of simulation replicates: the number of simulated samples
using all above input parameters and satisfied all the specified requirements.
Print recombination rates: the recombination rates along the sampled
chromosome interval, either using the Icelandic recombination data, or simulated by the GBM model,
will be printed in the text browser box, if the non-homogeneous recombination model is chosen.
Screen output:
All the inputs entered will be repeated in the output text browser box.
And for each simulated sample, the following will be displayed:
File output
Simulation results:
All simulation replicates are saved into one .XML file, which can be parsed
by perl modules, such as the XML::Parse module. The XML file consists of 3 parts: parameters,
simulation results (separated by <Num_*> for each simulation), and note. All 3 parts are
included in the root element <DATA>. For example:
<DATA> <Parameters> ... (Parameters that have been inputted by users) </Parameters> <Num_1> (the first simulated sample) <num_loci>6</num_loci> (number of markers) <interval_size>0.224208</interval_size> (physical distance of the last marker) <genotypes_normal> (phased genotypes of normal chromosomes in the sample) <n0a>1 1 1 1 2 1</n0a> \ -> (genotypes of one normal individual) <n0b>1 1 2 2 1 1</n0b> / <n1a>1 1 2 2 1 1</n1a> \ -> (genotypes of one normal individual) <n1b>1 1 2 1 1 1</n1b> / ... </genotypes_normal> <genotypes_disease> (phased genotypes of disease chromosomes in the sample) <n0a>1 1 2 2 1 1</n0a> \ -> (genotypes of one disease individual) <n0b>2 1 2 1 1 1</n0b> / <n1a>2 1 2 1 1 1</n1a> \ -> (genotypes of one disease individual) <n1b>2 1 2 1 1 2</n1b> / ... </genotypes_disease> <phy_dis>0.068526 0.071677 0.116376 0.173379 0.209431 0.224208</phy_dis> (physical distances of markers) <dis_frequency>0.168294</dis_frequency> (current frequency of the disease allele) <founding_popsize>635.463</founding_popsize> (population size at which the disease mutation first arose) <TMRCA_diseaseTree>645</TMRCA_diseaseTree> (TMRCA of the sample of disease chromosomes) <TMRCA_all>1084.715</TMRCA_all> (TMRCA of the entire sample) </Num_1>
<Num_2> (the second simulated sample) ... </Num_2> ...
<Note> ... (general notes, including # simulations, average # recombinations per genealogy, total running time) </Note> </DATA<
<DATA> <No1>1 1 0.466667 1 1 1</No1> (Pairwise LDs for the first simulated sample) <No2>1 1 1 1 1 1 1 1 1 0.8 1 1 1 0.333333 1<No2> (Pairwise LDs for the second simulated sample) ... </DATA>