|Primary Contact(s)||Created||Required Software||Example Datafile|
|Bob Thomson||1 March 2007||Structure (command line version), Distruct||marm_struct.input|
This tutorial outlines the basics of running Jonathan Pritchard's Structure and Noah Rosenberg's Distruct software packages. It uses a small example dataset containing SNP data for Western Pond Turtles (Emys marmorata).
Structure is a versatile tool that can be used in several ways. This tutorial shows how the command line version of "Structure" can be used with the sister program "Distruct" to investigate population structure in a group of California turtles. Although we advise anyone interested in "Structure" to learn the command line version of this software, we provide supplementary notes about installation and use of the "Structure" graphical front end as a foot-note.
Structure works by grouping individuals into clusters such that Hardy-Weinberg equilibrium is maximized within clusters. It can use most commonly employed genetic markers as data, including AFLPS, SNPs, microsatellites, sequence haplotypes, Hapstrs, SNPstrs, etc. The program works by assuming a model of K populations, each of which are characterized by a set of allele frequencies at each locus. The program groups individuals into the K populations in such a way as to maximize HW equilibrium within populations. By varying the K parameter across several runs of the program and inspecting the resulting probabilities under different values, we can infer a likely value of K which best captures the variation present in the data. This program is often used in conjunction with a second program, Distruct, which makes it easy to visualize the results. A final note before proceeding is that a graphical version of the program exists for windows and various flavors of Unix, we're going to use the command line version here because its more general, easier to customize, and easy to batch.
The data set we are using consists of 6 SNP loci (4 nuclear, 2 mitochondrial) for 43 individuals of the western pond turtle (Emys marmorata). These turtles fall out into 5 well-supported mitochondrial DNA clades, which are not well supported on a gene-by-gene basis at the nuclear level. You'd want more loci in a real analysis, but I've kept the dataset small here to allow for reasonable run times. The instructions ought to work for both OS X and Windows, though the Unix commands may not work for Windows users. I've tried to include instructions for both operating systems where it's critical.
Start by going into the folder containing the datafiles for structure...
cd ~/desktop/bodega_structure/datafiles/structure (or wherever you've stored your copy)
and list the contents...
ls (or dir for you non-unix folks)
You'll see three files; 2 parameter files for structure (mainparams and extraparams) and a datafile (marm_struct.input). The settings for each run are controlled by these two parameter files, which structure always looks for upon execution. It will also look for an input data file, the name of which you specify in the mainparams file.
open -e marm_struct.input (or just open in wordpad in Windows)
This is the basic Structure input file format. Eight columns correspond respectively to the name of the individual, which mtDNA clade each individual belongs to, and six SNP loci. Each SNP locus is coded using 1's and 0's (for the alternative alleles), and -9's (for missing data). Each diploid individual takes up two rows in the data file, which allows you to code homozygous and heterozygous positions, e.g.:
3259_Mader 1 1 1 0 0 -9 0 3259_Mader 1 1 1 0 1 -9 0
So individual 3259_Mader is in mtDNA clade 1, is homozygous for SNP loci 1,2,3, and 6; is heterozygous for locus 4; and is missing data for locus 5. Easy!
Now on to setting up Structure 's two parameter files.
Main Parameter File
This file contains the main parameters that you'll need to set in order to run the program. Each parameter looks something like:
#define MISSING -9 // (int) value given to missing genotype data
This ought to look familiar to those of you with experience programming in C, the define statement tells Structure that the value for MISSING data is -9. If you code your missing data some other way, simply change the number to match your coding scheme. Notice that each parameter takes a certain type of data, the options are integer (int), Boolean(B), or String(str), which means that you can't code your missing data with '?', you can only use integers.
On to the parameters themselves...
Most are self-explanatory, INFILE and OUTFILE correspond to your input datafile name and name of the file you want to output to, respectively. Note that the program will overwrite the outputfile if it already exists, this is important to remember when you are doing multiple runs at different values of K.
The next 14 parameters refer to the format of the data file, and tell the program what, precisely, we have in there. NUMINDS and NUMLOCI refer to the numbers of individuals and loci in your input data file. For our case,these should be set to 43 and 6 respectively. LABEL wants to know if we have labeled our individuals in the first column of the input file. We have, so it should be set to 1. Likewise, we've labeled individuals according to mtDNA clades so POPDATA should be set to 1. Note that we have the option of using this population data when inferring K and clustering individuals during the structure run. Here we are interested in letting the genetic data tell us what K is, so we won't use the population information. If you did want to do this, you would set the USEPOPINFO parameter in the extraparams file to 1, or you could include the statement USEPOPINFO==1 in your inputfile, this is what the POPFLAG parameter refers to, we didn't use this flag so set it to 0.
PHENOTYPE, EXTRACOLS, PHASEINFO, and MARKOVPHASE all refer to additional data that you can include in your data file for various reasons, none of which we've done here so they should all be 0.
MISSING and PLOIDY refer to exactly what they sound like. Our missing data is coded as -9 and turtles are diploid. The last three formatting parameters refer to more file format options and data that can be included, set these to 0 for this exercise.
Three final parameters in this file refer to the settings for the actual program run. You need to tell structure which value of K to assume using the MAXPOPS parameter, we want to infer the best value of K from the data so we'll run the program at several values and go back and compare the results (more on this below). Our mtDNA phylogeny suggests 5 clades so we'll run the program with MAXPOPS set at 1 through 5. Start with MAXPOPS equal to 2. Finally, we need to set the BURNIN and NUMREPS to do after the burnin. We're purposely using a simple dataset here so that we can run the program fast and complete this exercise in a reasonable amount of time. For us, these values can be set to 1000 and 100000 respectively. For more complex (i.e. realistic) datasets you'd want each of these to be an order of magnitude or so higher in order to make sure that the run converges. After the run finishes we'll check to make sure that we've run the markov chain long enough. Save and close this file.
Extra Parameter File
This file is set up similarly to the mainparams file, it contains parameters concerning what model of population genetic structure Structure uses, output options, priors and a few other parameters. We won't go through them all, as you'll rarely need to change most of them. If you're still curious, the program's documentation goes through them in detail.
One parameter to be concerned with here is FREQSCORR. Setting FREQSCORR to 1 causes structure to use a model in which the allele frequencies among closely related populations can be correlated. This can be useful in trying to differentiate between closely related populations, and other 'hard' structure problems. It is prone to overestimating K however so it should be used with care, particularly if you are interested in species boundaries.
Enough parameters, lets run the program already...
Structure automatically looks for the mainparams and extraparams files, as well as whatever input file you tell it (in the mainparams file) when you run the program. So just make sure these three files are in the same folder as the structure executable. If you're using OS X execute the program by typing ./structureOSX, on a PC just type structure.exe. You'll see a lot of output on the screen, and when the run finishes it will print the results to the file you've specified.
Before going any further, check the parameter estimates printed by the program throughout the run. You want to make sure that the estimates of alpha, D, and ln Likelihood have converged before the end of the burnin period is reached. It is also advisable to repeat runs several times to ensure that you get consistent answers. The run lengths I've recommended appear to be adequate for this dataset, but you would need to worry about this more with a real dataset.
If the parameter estimates look like they've converged, go ahead and open your output file.
open -e marm_struct.output_f
This file has some information about the settings for the run, a table that shows the proportion of each predefined population assigned to each cluster, a table of likelihoods, a table of the proportion of each individual's alleles assigned to each cluster, as well as some other information. Compare your estimate for the ln probability of the data around before moving on, it may not be exactly the same for everyone but it ought to be close.
Now that we've got output from Structure let's make one of those pretty figures that everyone's putting in their papers these days. This is done by giving some of the output from Structure to the separate program Distruct, which uses this data to produce a figure showing the proportion of each individual's ancestry inferred to be from each cluster, as well as how this relates to your predefined populations. You can glean the same information by staring at the tables in the Structure output file, but thats no fun and this way is much easier.
Distruct operates much like Structure in that it looks for a certain set of files when it runs. It has a single parameter file and needs two data files, it will also optionally take a few more files that tell it how to label populations, which colors to use, etc.
The two datafiles are simply the individual and population tables from the Structure output file. Simply copy and paste these two tables into their own plain text files. You only want the data itself, NOT the labels and formatting that structure puts around it. Your population file should contain text that looks like the following:
1: 0.109 0.891 23 2: 0.075 0.925 2 3: 0.987 0.013 5 4: 0.969 0.031 10 5: 0.883 0.117 3
and your individual file should look like (but longer, this is only the top of the file):
1 0001_Merce (0) 4 : 0.970 0.030 (0.800,1.000) (0.000,0.200) 2 0002_Merce (0) 4 : 0.955 0.045 (0.694,1.000) (0.000,0.306) 3 0227_Kern (0) 4 : 0.970 0.030 (0.802,1.000) (0.000,0.198) 4 0331_San_D (0) 3 : 0.990 0.010 (0.941,1.000) (0.000,0.059) 5 0669_XX (0) 1 : 0.304 0.696 (0.000,0.814) (0.186,1.000)
Distruct Parameter File
Once you have each data file made, open Distruct 's parameter file, which is called drawparams. Input the names of the two datafiles into the first two parameters. The next two parameters want the names of files which contain labels for your predefined populations. Our predefined populations are labeld 1 through 5, but we can have distruct translate these numbers to geographic locations. So just make a file containing the text:
1 North 2 San Bernardino 3 South 4 San Joaquin 5 Santa Barbara
save it, and input the filename into the INFILE_LABEL_BELOW parameter. We won't use any labels on top so skip the INFILE_LABEL_ATOP parameter. The INFILE_CLUST_PERM parameter takes a file which specificies the order in which to print the clusters as well as which colors to use. This file looks like
1 red 2 green 3 yellow 4 blue 5 orange
The order of numbers controls the order in which the clusters are printed, and the color labels tell the program which color to use. Finally, give the program the name of a file to output to.
The next section of the drawparams file wants the value of K used, set this to match whichever value of MAXPOPS you set back in the structure run, the NUMPOPS is 5 (the number of mtDNA clades), and the NUMINDS is 43. The remaining parameters are all printing options and are self-explanatory. Most can be left at the defaults, but make sure that you set PRINT_LABEL_ATOP to 0 and PRINT_LABEL_BELOW to 1, because this is how we've set up our label files.
Now you should be all set to actually run the program.
Again, make sure all the necessary files are in the same folder and type ./distruct_OSX or distruct.exe as the case may be...
You should now have a postscript file named whatever you put into the drawparams file (Figure1). If you're on a mac just open it with Preview, and it should automatically convert to a pdf and open. On a PC, you can use whatever postscript program you already have, or download Ghostscript and Ghostview, which is a free postscript viewing package.
Each vertical bar represents a single individual. The individuals are grouped into our 5 predefined populations. The colors represent the proportion of each individual's loci that are drawn from each of K = 2 clusters. For example, the leftmost individual in the above figure is from the northern mtDNA clade, and about a third of it's alleles are drawn from the other cluster. With K=2, Structure lumps the the northern mtDNA clade and the San Bernadino and Monterey clade into one cluster and all the other clades into another. The San Bernadino/Monterey clade seems odd, however this clade is only represented by 2 individuals and there is some evidence that the San Bernadino population is non-natural. It's possible that the San Bernadino individual is a relatively recent introduction from the Monterey area. This aside, the groupings make good biological sense.
In order to infer the `best' value of K, we now need to do a run at each value of K that we are considering. Go back to you mainparams file change the value of K and the outputfile (so you don't overwrite your results), and repeat the run for K=1,...,5.
There is no set way to choose the value of K which best fits the data, there are several informal guides that can be used however. You should always be careful in considering the biological reality of the `best' value of K, because Structure can be prone to overestimating K given even mild departures from the underlying model. However, when you're careful it appears to work well in practice. The goal should be to choose the lowest value of K which effectively captures the major structure present in the data. One way to do this is to compare the probabilities of the data given K for each value of K that you tried. It is often the case that the probability will be low at values of K less than the appropriate value, then plateau at and above the appropriate value. For example, in one set of my runs I got
K ln P(X|K) 1 -276.2 2 -186.7 3 -155.6 4 -155 5 -158.3
as you can see, ln P(X|K) plateaus around -155 at K=3. Alternatively, you can make this comparison by computing posterior probabilities using a uniform prior on K. I won't go through this here, however the calculation is straightforward and is laid out in detail in the Structure documentation. In practice, one tends to see the same pattern as for ln P(X|K), with very low P(K) at values less than the appropriate value, and then a plateau.
So inspection of the probabilities suggests that K is around 3 or 4. Now lets look at the Distruct outputs and see if this is biologically realistic. We've already seen the breakdown for K=2, K=3 and K=4 are printed below.
|Figure 2 - Example output from distruct with K set to 3.||Figure 3 - Example output from distruct with K set to 4.|
For K=3, Structure cleanly splits the Southern CA population from San Joaquin, while the Santa Barbara population appears to be drawn from each of these clusters. The northern population remains relatively unchanged. So K=3 is a better fit statistically, and it appears to capture more structure in the data than K=2. When we move to K=4, little changes except that the Northern/SB/Monterey cluster splits into two clusters with each individual admixed from both clusters. This occurs because the Northern population consists of largely invariant individuals. Because little variation exists, little change in Hardy-Weinberg equilibrium occurs by splitting the individuals into more and more populations and so the program chooses this as the least costly place to introduce an additional cluster. Clearly this doesn't make biological sense, so in this case I'd say K=3 provides the best fit to the data.
With more complex datasets, you may find that it's difficult to decide between two similar values of K. In these cases it can be useful to pull out the ambiguous populations and analyze them separately. This reduces the problem to a K=1 vs K=2 problem, which is generally easier to decide between than say K=15 vs. K=16.
These are the basic steps involved in using Structure and Distruct. It is also possible to use this tool to assign individuals to populations as well as to do association tests (by using locality and phenotype data in the clustering process, respectively). Keys to remember are to always be cautious in the biological interpretation of K and carefully consider how you to set up the parameters of each run. Without careful thought, odd conclusions are likely. Even in the simple dataset we've used here, straightforward computation of the posterior on K would lead you to conclude that K=4 is more likely than K=3, a result that makes little biological sense. In larger more complex datasets, this problem can become more difficult to overcome. Additionally, with large datasets it becomes increasingly critical to check that your burnin is long enough and to perform multiple runs to make sure that the chain has converged. As you might guess, the run time for the program increases with the size of the data set, and with the size of K; so complex data sets can require seriously long run times.
Downloading and Installing the Structure 2.2. Graphical Front End
Pritchard et al. have developed a graphical front end for Structure, however the installation instructions are brief and have caused some students trouble. Users who are just becoming familiar with the basics of file architecture may benefit from the supplementary instructions below, which are particularly intended to help those running MacOSX on an Intel processor.
1. Download the compressed installation file to your desktop and decompress this file by double-clicking it (if you're OS hasn't done this automatically). Place this folder in MacOSXs Applications folder.
2. Open the MacOSX Terminal application and navigate to the folder you just unzipped. If you haven't changed the default terminal preferences (which should start you off in the home directory of the default user) and have absolutely no experience navigating around folders in the Terminal you should be able to do this by typing "cd ../../Applications/structure2.2.3_install" at the prompt.
3. To install the package, all you need to do is to use the Terminal to run the install script included in the unzipped folder. You may find that simply typing "./install" results in the following error "mkdir: /usr/local/Structure/: Permission denied." If this is the case, you can either change permissions on this folder, or simply install the required information elsewhere by typing the name of the desired location after "./install".
4. Assuming everything worked, you should be able to launch the graphical front-end by typing "./structure" in the Terminal window. Of course, your working directory (i.e., the place you are in the Terminal) must correspond with the location of the structure executable, which should be the folder that you uncompressed earlier.
-Pritchard JK, and W Wen. 2002. Documentation for Structure software: Version 2 ( Documentation for Structure.)
-Rosenberg NA. 2003. Distruct: a program for the graphical display of population structure (Documentation for Distruct.)
-Pritchard JK, M Stephens, P Donnelly. 2000. Inference of Population Structure using multilocus genotype data. Genetics 155:945-959 (This is the original paper describing the method used by Structure, if you want the mathematical nitty-gritty this is where its at. Its relatively accessible for such a math-y paper.)
-Falush D, M Stephens, JK Pritchard. 2003. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164: 1567-1587( This paper describes Structure version 2.0 which allows for correlations in allele frequencies between closely related populations as well as linkage.)
-Rosenberg NA, et al. 2001. Empirical evaluation of genetic clustering methods using multilocus genotypes from 20 chicken breeds. Genetics 159: 699-713( This is one of the early applications of the method using microsatellites for chicken breeds. Its a good example of using the method for complex datasets where K is large.)
Note: You must be logged in to add comments