LDMAP

VERSION 1.0, NOVEMBER 2004.

Andrew Collins, Genetic epidemiology and Bioinformatics Research Group, University of Southampton.

LDMAP is a program for constructing linkage disequilibrium (LD) maps. LD maps are scaled in linkage disequilibrium units (LDUs) and show (when plotted against the physical map) a pattern of plateaus (reflecting regions of low haplotype diversity or 'LD blocks') and steps (which represent recombination hot-spots or recombination events). The LD map has a number of compelling advantages for disease mapping over both the sequence-based physical map and the, often low resolution, genetic linkage map. Unlike the physical map SNP markers should be spaced evenly on the LD map and one LD unit is equivalent to 1 swept radius (representing the extent of useful LD, see below). LD maps can be constructed automatically from both haplotype and genotype (diplotype) data. It is possible that different populations show the same PATTERN of LD even if the overall LENGTH of the LD map differs (reflecting different population histories). This appears to be true because recombination and the time over which recombination events have taken place are the major determinants of the LDU pattern. Recent evidence suggests a close relationship between the pattern of LD and recombination. The LD map may be useful to increase the resolution of the linkage map which will permit more detailed studies of sequence determinants of recombination pattern.

References.

LDMAP has been written to extend a body of work in this general area going back several years. Readers should refer to a number of papers for more detailed explanation of key terms. The main reference for LDMAP is Maniatis et al (2002) given in the following list which includes other relevant papers: references.

The software.

LDMAP is written in C for a SUN workstation, some modification may be required for use under linux. This program is under active development and the program supplied here may not be the latest version. Users should contact Andy Collins at arc@soton.ac.uk for updated information. The files supplied may be extracted using:

uncompress ldmap.tar.Z

tar xvf ldmap.tar

The executable program is controlled by a simple shell script called ldmap. If the command './ldmap' is given the script will run and give the following output: script. I will review, with examples, the process of constructing an intermediate file from both haplotype and diplotype data (options 1 and 2) and using this file for constructing LD maps (options 3 and 4).

Haplotypes and diplotypes with SNPs - construct intermediate file (OPTIONS 1 and 2).

DATA FILE

An example data file with haplotypes is given here: CF data and an example file with diplotypes is: diplotype data. I will describe the haplotype example in the most detail but the comments largely apply to both types of data. All LDMAP haplotype/genotype data files have three sections delimited by dashed lines. The first section usually contains a reference, as here to the Kerem et al., Science paper. The second section defines the columns given in the data section. There are 23 marker loci (R1...R23) and a count N defining the number of haplotypes of that particular type. Each line of the third section then contains the haplotype data for either an individual or the number of examples of that haplotype found in the sample. Column descriptions are given with brackets and each locus name is followed by a comma, column position in the file for the allele(s) (eg. 1-4) and a further comma followed by a location. Typically the location is given in kilobases (kb). Note that these are disease case and control haplotypes but we are ignoring phenotype for this demonstration of LD map construction. Alleles are coded 1, 2 and missing data is indicated by blank or zero. Note that column locations must be specified in the order they appear in the file but map location need not be.

JOB FILE

The job file has the function of defining the models to be tested by the program. A very simple example is given here which estimates parameters M and E in the Malecot model. Job files contain a PA 'control' followed by 1 or more IT controls. PA stands for parameter and it defines parameter starting values in the model. The IT control defines which parameters are to be iterated. The three parameters describe the pattern of linkage disequilibrium in the region (see references). Multiple PA and IT controls may be placed in the job file. For constructing LD maps the third Malecot model parameter (L) is best left out of this file and used by the program as the predicted L (computed internally) as defined by Morton et al (2001). CC always terminates a job file. The job file follows here: JOB FILE.

INTERMEDIATE (output) FILE

The main output at this point is the 'intermediate' file: INTERMEDIATE FILE. These files always have the same basic format whether derived from haplotype or diplotype data: diplotype int file . The Malecot model may be fitted directly to intermediate file data (still under 'job' file control). The fields in the file are given below and include locations for both loci/alleles in a pair, their map locations, the association rho and its information, chi square, sample size n, allele frequencies Q and R and disequilibrium, D.

Column definitions:

locus1          locus2            location1  location2    rho           Krho     chiSq   n       Q            R           D         
R1              R2                   0.000      0.009 0.9065040650     22.188    18.23   184 0.1304347826 0.5543478261 0.0526937618
R1              R3                   0.000      0.024 0.8778778779     98.660    76.03   183 0.4043715847 0.5573770492 0.1571262206
R1              R4                   0.000      0.524 0.1525498891     64.631     1.50   182 0.3021978022 0.5494505495 0.0207704384
R1              R5                   0.000      0.534 0.1676829268     66.329     1.87   182 0.3076923077 0.5494505495 0.0232459848
R1              R6                   0.000      0.554 0.3657575758    126.283    16.89   182 0.3626373626 0.4505494505 0.0728776718
R1              R7                   0.000      0.569 0.3576470588    132.392    16.93   182 0.3736263736 0.4505494505 0.0734210844
R1              R8                   0.000      0.594 0.5492063492     52.534    15.85   142 0.3169014085 0.5563380282 0.0772168221
R1              R9                   0.000      0.614 0.2819512195    101.781     8.09   184 0.4076086957 0.5543478261 0.0512169187
R1              R10                  0.000      0.619 0.3006018372    106.448     9.62   184 0.4184782609 0.5543478261 0.0560609641
R1              R11                  0.000      0.654 0.2913992298    104.093     8.84   184 0.4130434783 0.5543478261 0.0536389414
R1              R12                  0.000      0.684 0.2550779405     69.977     4.55   166 0.3493975904 0.5602409639 0.0391929162
R1              R13                  0.000      0.709 0.3212022081     88.184     9.10   166 0.4036144578 0.5602409639 0.0570111772
R1              R14                  0.000      0.744 0.2550779405     69.977     4.55   166 0.3493975904 0.5602409639 0.0391929162
R1              R15                  0.000      0.779 0.1862567812     63.584     2.21   180 0.3111111111 0.5611111111 0.0254320988

CONSTRUCTING THE LD MAP FROM THE INTERMEDIATE FILE.

Once constructed from either haplotype or diplotype data the intermediate file can then be used for the construction of an LD map (OPTIONS 3 and 4). Option 3 will allow the construction of very large maps in sections, while option 4 assembles the map in one piece. The same (or different) job file may be used and the ("analysis") output file containing the final LD map is given here: LDMAP analysis output file. There is also a shorter output ("map") file: LDMAP map output file . As an example the output from the diplotype analysis is given here: LDMAP (diplotype) analysis output file and LDMAP (diplotype) map output file . The method for constructing the map involves the estimation of the epsilon parameter (E) in the Malecot equation for each interval (between adjacent SNPs) in the map. Any larger interval that includes the one being estimated contains some information about epsilon unless the markers in the pair are at such large distance that they contain no more useful information about LD. In running option 4 there are yes/no queries asking whether any of the mapping defaults be changed. For most purposes the response 'no' to all queries will give a satisfactory result. The output file firstly gives the result of fitting the physical (kb) map to the intermediate file data and the final -2lnL is 1298.17 . During the construction of the LD map the Malecot parameter M is updated at 25, 50, 100 and 200..etc. iterations , which improves the approach to convergence. Finally the fit of the pairwise association data to the LD map is obtained and the -2lnL is now 289.40, a substantial improvement reflecting how well the LD map describes the pattern on LD in the region. Note that epsilon (E) is ~1 in LD maps. The LD map shown here has 6.51 LDUs length in a region of 1769 kb. Full details of the method are described in Maniatis et al (2002).