ETDIV and ETCLUS This site contains programs to analyze genetic diversity and relationships among bacterial strains characterized by multilocus enzyme electrophoresis (Selander et al. 1986, App. Environ. Microbiol. 51:873-884). The programs can also be used for with other types of binary state of multistate data with unordered categories. The data should be stored as integer files with 0 (null alleles) to be treated as missing data. The input data files need to be stored as text files in the format described below. ETDIV finds and lists the electrophoretic types (ETs) in a collection of bacterial isolates with multilocus enzyme profiles. The program writes the results to an output file and creates a file named ETLIST.DAT to be used as input for ETCLUS. The input file for ETDIV must have the following format: Line 1 title Line 2 no. of isolates, loci, and populations (3I4) Line 3 enzyme labels (15A4) continue to next line if the no. enzymes exceeds 15 Line 4 population labels (15A4). Do not include this line if the no. of populations in Line 2 = 1. Line 5 input FORTRAN format statement for reading enzyme profiles Line 6 and on isolate labels, population codes, and enzyme profiles. Read by format statement in Line 5. Isolate labels can be up to 10 columns in width (A10) and all other numbers must be integers. Population codes must be consecutive integers. For an analysis of 100 isolates, characterized for 6 enzymes, and sampled from 3 populations, the input file would look something like this: Test data title (Line 1 - your title) 100 6 3 (Line 2 - 3 numbers, 4 columns each) PGI IDH ACO G3P PE2 MDH (Line 3 - Enzyme symbols) SP1 SP2 SP3 (Line 4 - populations labels) (A10,I2,6I3) (Line 5 - format statement) isolate1 1 5 3 3 2 2 5 (Line 6 - label, population code, enzyme isolate2 2 5 4 4 2 2 4 profile - all integers) ... and so forth for a total of 100 isolates For analysis as a single population, the input file would be: Test data title (Line 1 - your title) 100 6 1 (Line 2 - 3 numbers, 4 columns each) PGI IDH ACO G3P PE2 MDH (Line 3 - Enzyme symbols) (A10,2X,6I3) (Line 4 - format statement, skip code) isolate1 1 5 3 3 2 2 5 (Line 5 - label and enzyme profile) isolate2 2 5 4 4 2 2 4 ... and so forth for a total of 100 isolates The format statement (Line 4) tells the program how to read the columns of a single line of data. It is a leftover from the old days of FORTRAN programming. In this case, the format line specifies the first 10 columns as an alphanumeric variable (the strain label), skips 2 columns, and then read 6 integer variables (one for each enzyme locus), each of 3 columns in width. See example data file in the file TEST.DAT. An annotated output file is supplied in ETDIV.OUT To execute, type ETDIV and respond to queries. Maximum parameters for ETDIV are: max. no. of isolates 1000 max. no of enzyme loci 40 max. no of populations 12 max. no of ETs 200 max. no alleles per locus 30 ETDIV has a special capacity for null alleles. Null alleles occur when there is no detectable enzyme activity at a locus. If nulls are scored as '0', ETDIV pools isolates which only differ in the allele profiles for nulls. The program first defines all ETs including nulls and then pools the ETs that differ only by nulls. To choose an ET for pooling in cases of ties (i.e. ET with differs by a null from more than one other ET) it selects the ET with the most isolates. If you want nulls to be treated the same as other alleles, simply code the null alleles as an integer other than 0 (for example 99). However, in this case, you are assuming, as with other electromorphs, that representative of an allele are identical by descent. ETCLUS uses the output file ETLIST.DAT created by ETDIV and finds a dendrogram based on the average linkage algorithm. Distance is measured as the proportion of mismatched loci between pairs of ETs. Null alleles that are scored as '0' are not used in the calculation of pairwise distances. To execute type ETCLUS and respond to queries. For input file, type ETLIST.DAT or the name of a file with a similar format to ETLIST.DAT. Maximum parameters for ETCLUS are: max. no. of isolates (or ETs) 100 max. no of enzyme loci 40 OTHER PROGRAMS ETJOIN finds a single NJ tree and estimates branch lengths from a distance matrix based on the method of Saitou and Nei (1987, Mol. Biol. Evol. 4:406-425) and Studier and Keppler (1988, Mol. Biol. Evol. 5:729-731). ETJOIN uses the same input file format and has the same default parameter values as ETCLUS. To execute, type ETJOIN and respond to queries. I do not use this program anymore. Instead I use ETMEGA and MEGA (Kumar, Tamura, and Nei, 1994, CABIOS 10:189). ETMEGA creates a distance matrix for input into the MEGA program. This program uses the same input file format and has the same default parameter values as ETCLUS. It calculates genetic distance between pairs of ETs and writes a file in the MEGA input format. Note that MEGA does not except blanks spaces within the strain labels, so replace these blank spaces with some other symbol. The output file from ETMEGA is then used an input with the distance matrix choice under the data file input choice of MEGA. ETBOOT is a bootstrap program that randomly selects loci, obtains a distance matrix, finds a tree (based on average linkage or neighbor joining tree, and records the nodes of the tree. The process is repeated for a number of bootstrapped trees (input by the user). ETBOOT then tabulates the number and frequency of each observed node recovered among the randomly generated trees. ETLINK calculates several measures of linkage disequilibrium, including the distribution of standardized coefficient (D') between all pairs of alleles, the two-locus coefficient Q* for multiple alleles per locus, and the indices of multilocus association based on the properties of the mismatch distribution. For information and references about these measures, see Whittam et al. (1983, Proc. Natl. Acad. Sci. USA 80:1751-1755) and Hedrick and Thomson (1986, Genetics 112:135-156). To execute type ETLINK and respond to queries. For input, this program uses the standard data file in the same format as those used by ETDIV. The ETLIST.DAT file can also be used as input. In this latter case, the coefficients are calculated with ETs as the sampling unit. Estimates of Q* for all pairwise comparisons between loci are written to a separate file called QSTAR.OUT. An example output file with annotations is listed in ETLINK.OUT Maximum parameters for ETLINK are: max. no. of isolates (or ETs) 400 max. no. of loci 40 max. no alleles per locus 30 If you have questions or problems, please contact me. Thomas S. Whittam Institute of Molecular Evolutionary Genetics Pennsylvania State University University Park, PA 16802 (814)-863-1970 tsw1@psuvm.psu.edu