Browse GeneArtisan Help




Overview



GeneArtisan is a computer program for simulating population samples of disease and normal chromosomes bearing multiple linked genetic markers (SNPs or STRPs) under a model of mutation, genetic drift, selection and recombination. The simulation method allows samples to be generated under a coalescent process with a case-control sampling strategy and polymorphic marker ascertainment. The method incorporates several features to ensure a more realistic simulation to be carried out, including selection acting on disease alleles; sample ascertainment of disease chromosomes and polymorphic markers; a genetic dominance model of disease expression that allows incomplete penetrance and phenocopies; and an accurate genetic map of recombination rates and hotspots for recombination in the human genome (or alternatively, an improved method for simulating the distributions of hotspots).


References:



Input



Population data

Number of disease individuals: the number of disease individuals in the sample. If it is set to be 0, a usual simulation will be carried out without the case-control sampling strategy.

Number of normal individuals: the number of normal individuals in the sample. It must be greater than 0.

Data type: if the phased genotype data is chosen, the total number of the disease and normal chromosomes will be the double of the sum of above two numbers. The number of individuals in two categories (cases and controls) will be simulated using the disease penetrance-sampling model with the specified penetrance parameters.



Population demography

Population size: the diploid population size in the current generation.

Growth rate: it is assumed that the population size has exponentially grown with the given rate. The growth rate can also be set to 0 for a constant-size population.

Disease mutation age: the time (in generations) at which the disease mutation first arose in the past. Ensure that the population size is not too small at this time, given the present-day population size and the growth rate. The minimal limitation of the population size at which the disease mutation first arose is 2.

Selection coefficient: is the selection coefficient of the disease allele. There is no limitation on the selection coefficient.

Current disease frequency button: is used for examining the expected population frequency of the disease allele at current generation and its variance. It is obtained by simulating the sample path (over time) of the population frequency of the disease allele for 1000 iterations, conditional on non-extinction and non-fixation. The initial frequency of the disease allele when it first arose is set to be 1 over the population size at that time. Then the appropriate disease mutation age can be chosen to obtain a desired population frequency of the disease allele in present-day generation. If the proportion of the simulated population in which the disease allele was fixed, or lost, is more than 0.98, for the first 1000 simulations of the sample path, the program will be terminated. It is required to modify the above 4 parameters to continue running the simulation program.

Boundaries of the current disease allele frequency: low and high boundaries (exclusive) of the population frequency of the disease allele need to be given. Only the simulations in which the current frequency of the disease allele is within the above range will be retained. It is for some analyses that require the current disease frequency within a certain range. If the restriction is not needed, the two numbers will be simpliy set to 0 and 1.



Genetic data

Marker type: the mutation rate is between 10^(-9) and 10^(-6) for SNP markers, and is between 10^(-2) and 10^(-5) for STRP markers. The default STRP marker density is 55.4 for dinucleotide repeats, and is 11.8 for trinucleotide repeats (Ramser, et al 2001).

Minimum # markers: for a simulated sample, if the number of polymorphic markers which satisfied the following marker polymorphism cutoff level is less than the defined minimum # markers, the simulation will be discarded. The minimum # markers must be greater than 0.

Minimum allele frequency: is the marker polymorphism cutoff level. Only the markers, which have the minimum allele frequency greater than or equal to the defined value, will be retained. If the minimum allele frequency is too high and the minimum number of markers required is also large, then the proportion of rejected simulations will be increased. The program will check the proportion of rejections, and if it is greater than 95% for the first 100 simulations, the simulation program will be terminated. More reasonable parameters should be given considering to obtain more realistic simulated samples.



Recombination model

If a homogeneous recombination model is chosen, the assumption of 1 Mb equivalent to 1 cM is applied. Otherwise, either the Iceland recombination data, or the simulated recombination rates using the geometric Brownian motion (GBM) model, will be used.

Iceland data: the available markers' name, physical distances, and map distances for all autosomal chromosomes have been provided by Kong et al (2001). The physical distances of these markers to be used for the simulations are obtained by automatically searching UCSC Genome Browser on Human May 2004 Assembly. After the chromosome, and the left (From), right (To) edges (in bps) of the sampled chromosome interval, are specified, the map distances of all markers within the specified range will be used in the simulations. In each sub-region separated by the adjacent markers, the recombination break points are uniformly distributed.

   The low and high boundaries of the physical distances of all markers, whose map distances are available in the Icelandic recombination map, will be displayed in the Region range (bp) box. Only the region between these two boundaries can be inputted as the left and right edges of the sampled chromosome interval. The interval size of the sampled chromosome interval will be automatically updated according to the physical locations of its left and right edges.


GBM: the distribution of non-homogeneous recombination rates along a chromosomal region is simulated using the GBM model with the recombination rate evolving over the length of the chromosome. The default drift and diffusion parameters of the GBM are the averages of the maximum likelihood estimator over all autosomal chromosomes (pooling females and males), calculated by using Kong et al data (2001).



Output options

If the box of the computations of the pairwise LDs between the disease mutation and the markers is checked, the name of the file for saving these values need to be given in the Results of LDs are written into the file box. The parameters and all simulation results will be saved into the file with name inputted by user in the Output file name box. Both files are XML format.



Others

Interval size (Mb): if the homogeneous recombination model, or the GBM non-homogeneous model, is chosen, the length of the sampled chromosome interval (in Mb) need to be specified, and its maximum value is 6 Mb for an exponentially growed population, and is 0.05 Mb for a constant-size population. The maximum value is set due to the limitations of running time and memory requirement. Many aspects can affect the running time of the program. The time for simulating the genealogy of a sample can be greatly influenced by the current population size and especially the population growth rate. For example, if the current population size is 1000, and the growth rate is 0, the expected time to the most recent common ancestor of a sample of 2 normal chromosomes with interval size of 1 Mb is 4.852*10^(9) generations.

Mutation location (Mb): the disease mutation location must be within the chromosome region, between 0 and the size of the interval. The left edge of the interval will be defined as 0, so the mutation location is the distance apart from the left edge. If it is required that the disease mutation location is to the left of all markers, simply set the mutation location to be 0. On the contrary, let it equal to the size of the interval, if the mutation location, which is on the right of all markers, is preferred.

The reason why the mutation location beyond the interval is not allowed: the only difference for this case is that markers in the sub-region bounded by the mutation location and the left, or right, edge, are discarded.

Number of simulation replicates: the number of simulated samples using all above input parameters and satisfied all the specified requirements.

Print recombination rates: the recombination rates along the sampled chromosome interval, either using the Icelandic recombination data, or simulated by the GBM model, will be printed in the text browser box, if the non-homogeneous recombination model is chosen.



Output


Screen output:

All the inputs entered will be repeated in the output text browser box. And for each simulated sample, the following will be displayed:

  1. The actual number of disease and normal chromosomes in the sample simulated by the penetrance-sampling model, if genotype data is chosen.
  2. The time to the most recent common ancestor of the sample.
  3. The time to the most recent common ancestor of the sample of disease chromosomes.
  4. Number of recombinations occurred in the history of the sample.
  5. Number of recombinations occurred in the history of the sample of disease chromosomes.
  6. Number of mutations occurred in the history of the sample.
  7. Number of markers satisfied the polymorphism cutoff level.
  8. Current population frequency of the disease allele.
After all of the simulation replicates are generated, the average value of the above 2-8 items will be computed and displayed based on these simulations. The proportion that the simulated sample was rejected due to the defined restrictions, such as the minimum # markers, the minimum allele frequency et al, will be displayed as well.

The conversion of simulation results to the files with format of the input files of the DMLE program (http://www.dmle.org) will be enabled to use, as long as the simulation is finished and the number of disease individuals in the sample is greater than 0. All of the files will be saved in a directory specified in the Dir/File name box. Each simulated sample will generate a single input file of DMLE named as its directory's name appending the index number of each simulated sample.

Number of markers: is the number of markers that will appear in the DMLE input files. Those markers are chosen randomly from the original markers of each simulated sample. It is for comparable purpose of some analyses required that all samples have same number of markers. This number must be less than or equal to the minimum # markers among all simulation replicates displayed in the text browser box.



File output

Simulation results:

All simulation replicates are saved into one .XML file, which can be parsed by perl modules, such as the XML::Parse module. The XML file consists of 3 parts: parameters, simulation results (separated by <Num_*> for each simulation), and note. All 3 parts are included in the root element <DATA>. For example:

<DATA>
	<Parameters>
	  ... (Parameters that have been inputted by users)
   	</Parameters>
	
	<Num_1>		(the first simulated sample)
		<num_loci>6</num_loci>  (number of markers) 
		<interval_size>0.224208</interval_size> (physical distance of the last marker)
		<genotypes_normal>  (phased genotypes of normal chromosomes in the sample) 
		<n0a>1 1 1 1 2 1</n0a> \ -> (genotypes of one normal individual)
		<n0b>1 1 2 2 1 1</n0b> /
		<n1a>1 1 2 2 1 1</n1a> \ -> (genotypes of one normal individual)
		<n1b>1 1 2 1 1 1</n1b> /
			...
		</genotypes_normal>		
		<genotypes_disease> (phased genotypes of disease chromosomes in the sample)
		<n0a>1 1 2 2 1 1</n0a> \ -> (genotypes of one disease individual)
		<n0b>2 1 2 1 1 1</n0b> /
		<n1a>2 1 2 1 1 1</n1a> \ -> (genotypes of one disease individual)
		<n1b>2 1 2 1 1 2</n1b> /
			...
		</genotypes_disease>
		<phy_dis>0.068526   0.071677	0.116376	0.173379	0.209431	0.224208</phy_dis> (physical distances of markers)
		<dis_frequency>0.168294</dis_frequency> (current frequency of the disease allele)
		<founding_popsize>635.463</founding_popsize> (population size at which the disease mutation first arose)
		<TMRCA_diseaseTree>645</TMRCA_diseaseTree> (TMRCA of the sample of disease chromosomes)
		<TMRCA_all>1084.715</TMRCA_all> (TMRCA of the entire sample)
  	</Num_1>
	
<Num_2> (the second simulated sample) ... </Num_2> ...
<Note> ... (general notes, including # simulations, average # recombinations per genealogy, total running time) </Note> </DATA<


Pairwise LDs between the disease mutation and markers:

The pairwise LDs between the disease mutation and markers measured by |D'| is saved into an .xml file as well. All data are included in the root element <DATA>, and the result for each simulated sample is separated by <No*>. For example:
<DATA>
	<No1>1 1 0.466667 1 1 1</No1> (Pairwise LDs for the first simulated sample)
   	<No2>1 1 1 1 1 1 1 1 1 0.8 1 1 1 0.333333 1<No2> (Pairwise LDs for the second simulated sample)
	 ...
</DATA>



An input example



Command line program


The input paramters are same as those appeared in the GUI. An input example (named input_ex) has been included. To run the program using the example:
./GeneArtisan < input_ex
It is suggested to run the program interactively first, as some parameters may result in extra parameters to be entered.

References


1. Kong A, et al A high-resolution recombination map of the human genome. Nat Genet 31, 241-247 (2002)

2. Ramser, et al. Initial sequencing and analysis of the human genome. Nature 409, 934-941 (2001) The Human Genome Project