Maximum Likelihood Estimation of Allele Ages


Documentation for program bdmc21

Program for Monte Carlo Likelihood Estimation of Allele Ages

Bruce Rannala, University of California at Berkeley

version 2.1, October 7, 1998

Purpose:

Calculates a maximum likelihood estimate of the time (t1) at which a unique non-recurrent mutation first arose in a population using the observed pattern of mutations at markers that are completely linked to the mutation.

Obtaining the bdmc21 program:

The program is currently available for the Windows NT/98/95 (bdmc21.exe) and the Linux-Intel (bdmc.21.tar.Z) operating systems. The program is available from our web page at www.rannala.org.

Changes since release 2.0:

A serious bug was found (detected by Giorgio Bertorelle, thanks Giorgio) in release 2.0 that produces unreliable likelihoods when the joint probability of the number of mutant copies and the intraallelic variability are used. Please rerun any earlier analyses using release 2.1 which has fixed this bug.

The program parameters and data are described in the following paper: Slatkin, M., and R. Rannala. 1997. Estimating the Age of Alleles by Use of Intraallelic Variability. American Journal of Human Genetics 60:447-458.
Brief descriptions of the parameters, data, and options are given below.

Program parameters:

1. growth rate and selection coefficient (xi):

this should range between -1 and 1. Note that a negative value indicates population decline and (or) negative selection and a positive value indicates population growth and (or) positive selection.

2. sample fraction (f):

this is the fraction of chromosomes in a population that are sampled f=n/2N, where n is the total number of chromosomes sampled and N is the (diploid) population size.

3. mutation rate (mu):

this is the total mutation rate for the linked markers. Typical values for microsatellites are 10^-3 to 10^-4.

Program data:

1. mutant copies (i):

this is the total number of chromosomes carrying the non-recurrent mutation that occur in the sample.

2. segregating sites (S):

this is the total number of mutations that have occurred at the linked markers since the non-recurrent mutation arose.

Program options:

1. conditional on number of copies? (y,n):

specifying y (yes) causes the likelihood to be determined as a function of the observed number of copies of the non-recurrent mutation in a sample of chromosomes, as well as the number of mutations at linked markers. Specifying n (no) causes the likelihood to be determined as a function only of the number of mutations at linked markers.

2. Monte Carlo replicates (R):

this is the number of replicates for the Monte Carlo integration procedure used to estimate each likelihood by integrating over the coalescence times (R between 1000 and 10,000 for most datasets; the smoothness of the likelihood surface should indicate whether this number is large enough; the estimate of the likelihood is unbiased so errors generate random noise about the points of the curve).

3. initial t1, final t1, and increments:

the program is designed to generate a set of points for plotting the log likelihood curve. The points will be generated from initial_t1*increment to final_t1*increment at intervals of length increment. For example initial_t1=1, final_t1=5, and increments=10 will generate the likelihoods for t1 taking the following values:

10,20,30,40,50

while initial_t1=5 final_t1=20 and increments=5 will generate likelihoods for t1 taking the following values:

25,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100

Program outputs:

The program output is placed in an output file that you specify and is in the Mathematica matrix format. I used ListPlot in Mathematica to generate my log likelihood curves by reading in the data directy from the output files. You may also cut and paste. In case you are unfamiliar with Mathematica matrices the output is in the form

{{t1_1,lik_1},{t1_2,lik_2},...,{t1_k,lik_k}}

where t1_i is the value of t1 for the ith increment, lik_i is the log likelihood calculated for t1_i, and there are k increments in total.

Running the program:

The program may be run (in UNIX) with command line options as follows:

bdmc21 infile outfile seed

where infile is the name of an input file, outfile is the name of an output file to be created, and seed is a positive integer to initialize the random number generator. If no command line arguments are given (or you are running the Windows version) you will be prompted for these inputs.

Input file format:

The program is completely unforgiving of variations in input file format, so please follow this format in all aspects (including spacing). Here is an example formatted input file using data from our paper (test.inp):

PARAMETERS

growth+selection(xi)= 0.005

sample_fraction(f)= 0.00014

mutation_rate(mu)= 0.0001

DATA

mutant_copies(i)= 1705

segregating_sites(S)= 46

OPTIONS

conditional_on_copies?(y,n)= y

Monte_Carlo_replicates(R)= 5000

initial_t1= 5

final_t1= 25

increments= 10