Recent Changes for "Structurama" - Bodega Phylogenetics Wikihttp://bodegaphylo.wikispot.org/StructuramaRecent Changes of the page "Structurama" on Bodega Phylogenetics Wiki.en-us Structuramahttp://bodegaphylo.wikispot.org/Structurama2009-06-04 06:36:52BobThomsonComment added. <div id="content" class="wikipage content"> Differences for Structurama<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 115: </td> <td> Line 115: </td> </tr> <tr> <td> </td> <td> <span>+ ------<br> + ''2009-06-04 07:36:52'' [[nbsp]] Thanks for pointing this out. The website has moved to this address http://fisher.berkeley.edu/structurama/. John Huelsenbeck recently posted an update on the new version of the program [http://treethinkers.blogspot.com/2009/05/programs-gone-awol-structurama.html?showComment=1243866025770#c1409363696610850262 here] --["Users/BobThomson"]</span> </td> </tr> </table> </div> Structuramahttp://bodegaphylo.wikispot.org/Structurama2009-06-04 06:33:58BobThomsonurl update <div id="content" class="wikipage content"> Differences for Structurama<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 6: </td> <td> Line 6: </td> </tr> <tr> <td> <span>-</span> ||[http://<span>www</span>.structurama<span>.org</span>/<span>index.html</span> Structurama]|| </td> <td> <span>+</span> ||[http://<span>fisher</span>.<span>berkeley.edu/</span>structurama/ Structurama]|| </td> </tr> </table> </div> Structuramahttp://bodegaphylo.wikispot.org/Structurama2009-06-02 12:51:43Comment added. <div id="content" class="wikipage content"> Differences for Structurama<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 108: </td> <td> Line 108: </td> </tr> <tr> <td> </td> <td> <span>+ ------<br> + ''2009-06-02 13:51:43'' [[nbsp]] Dear Sir,<br> + The software "Structurama" is not available anymore as The link "http://www.structurama.org/index.html" is not working anymore.<br> + Best regards.<br> + JL legras legras@colmar.inra.fr<br> + <br> + --79.93.189.182</span> </td> </tr> </table> </div> Structuramahttp://bodegaphylo.wikispot.org/Structurama2008-03-25 10:59:06BobThomsonmove summary <div id="content" class="wikipage content"> Differences for Structurama<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 10: </td> <td> Line 10: </td> </tr> <tr> <td> <span>- [[TableOfContents]]<br> - </span> </td> <td> </td> </tr> <tr> <td> Line 14: </td> <td> Line 12: </td> </tr> <tr> <td> </td> <td> <span>+ <br> + [[TableOfContents]]</span> </td> </tr> </table> </div> Structuramahttp://bodegaphylo.wikispot.org/Structurama2008-03-19 13:45:48BobThomson(quick edit) <div id="content" class="wikipage content"> Differences for Structurama<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 2: </td> <td> Line 2: </td> </tr> <tr> <td> <span>-</span> ||<span>{</span>"BobThomson" Bob Thomson]|| </td> <td> <span>+</span> ||<span>[</span>"BobThomson" Bob Thomson]|| </td> </tr> </table> </div> Structuramahttp://bodegaphylo.wikispot.org/Structurama2008-03-19 13:45:40BobThomson(quick edit) <div id="content" class="wikipage content"> Differences for Structurama<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 2: </td> <td> Line 2: </td> </tr> <tr> <td> <span>-</span> ||Bob Thomson|| </td> <td> <span>+</span> ||<span>{"BobThomson" </span>Bob Thomson<span>]</span>|| </td> </tr> </table> </div> Structuramahttp://bodegaphylo.wikispot.org/Structurama2008-03-19 13:45:19BobThomson(quick edit) <div id="content" class="wikipage content"> Differences for Structurama<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 107: </td> <td> Line 107: </td> </tr> <tr> <td> <span>-</span> ''2008-03-19 14:44:42'' [[nbsp]] Note: I noticed that in several of the student presentations the default values for the 'shape' and 'scale' parameters were used. When datasets are uninformative about K being larger than 1, you should note that these default settings actually put almost no prior probability on K=1. Because of this, the method will infer something greater than K=1 and will not strongly favor one value o<span>f K o</span>ver another. Just <span>a</span> reminder to make sure that your data isn't overly sensitive to the prior. --["Users/BobThomson"] </td> <td> <span>+</span> ''2008-03-19 14:44:42'' [[nbsp]] Note: I noticed that in several of the student presentations the default values for the 'shape' and 'scale' parameters were used. When datasets are uninformative about K being larger than 1, you should note that these default settings actually put almost no prior probability on K=1. Because of this, the method will infer something greater than K=1 and will not strongly favor one value over another. Just <span>one more</span> reminder to make sure that your data isn't overly sensitive to the prior. --["Users/BobThomson"] </td> </tr> </table> </div> Structuramahttp://bodegaphylo.wikispot.org/Structurama2008-03-19 13:44:42BobThomsonComment added. <div id="content" class="wikipage content"> Differences for Structurama<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 106: </td> <td> Line 106: </td> </tr> <tr> <td> </td> <td> <span>+ ------<br> + ''2008-03-19 14:44:42'' [[nbsp]] Note: I noticed that in several of the student presentations the default values for the 'shape' and 'scale' parameters were used. When datasets are uninformative about K being larger than 1, you should note that these default settings actually put almost no prior probability on K=1. Because of this, the method will infer something greater than K=1 and will not strongly favor one value of K over another. Just a reminder to make sure that your data isn't overly sensitive to the prior. --["Users/BobThomson"]</span> </td> </tr> </table> </div> Structuramahttp://bodegaphylo.wikispot.org/Structurama2008-03-19 13:12:27BobThomson <div id="content" class="wikipage content"> Differences for Structurama<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 1: </td> <td> Line 1: </td> </tr> <tr> <td> </td> <td> <span>+ ||&lt;bgcolor='#E0E0FF'&gt;'''Primary Contact(s)'''||<br> + ||Bob Thomson||<br> + ||&lt;bgcolor='#E0E0FF'&gt;'''Created'''||<br> + ||19 March 2008||<br> + ||&lt;bgcolor='#E0E0FF'&gt;'''Required Software'''||<br> + ||[http://www.structurama.org/index.html Structurama]||<br> + ||&lt;bgcolor='#E0E0FF'&gt;'''Example Datafile'''||<br> + ||[[File(marm.in)]]||<br> + <br> + [[TableOfContents]]<br> + <br> + =Summary=<br> + ''This tutorial will take you through the basics of using John Huelsenbeck's ''Structurama'' software. This method is similar to the method implemented in Jonathan Pritchard's ''Structure'' software. ''Structurama'' differs from ''Structure'' in that it allows you to place a prior on the number of populations, ''K'', and estimate the value directly. Note that ''Structurama'' implements only the basic model from ''Structure'' so if you're interested in looking at things like admixture, linkage, or correlated allele frequencies you'll want to have a look at ''Structure'' also.''<br> + <br> + =Introduction=<br> + <br> + ''Structurama'' works by grouping individuals into clusters such that Hardy-Weinberg equilibrium is maximized within clusters. It can use most commonly employed genetic markers as data including microsatellites and SNPs. The program works by assuming a model of K populations, each of which are characterized by a set of allele frequencies at each locus. The program groups individuals into the K populations in such a way as to maximize HW equilibrium within populations. But how many populations are there? A nice thing about ''Structurama'' is that it allows you to place a prior distribution on K and let the data determine the most appropriate value, as well as give you posterior probabilities of each possible value of K. With ''Structure'', one had to run the program several times under different values of K and then determine the best value ''post hoc'', which can be very difficult in some cases.<br> + <br> + The data set we are using consists of 6 SNP loci for 35 individuals of the western pond turtle (''Emys marmorata''). These turtles are distributed along the west coast of North America ranging from Washington to Baja California and fall out into 3 well-supported mitochondrial DNA clades, one in the north, one in the south, and one in the San Joaquin valley. Nuclear genes that we've sequenced aren't variable enough to be informative about these clades in a phylogenetic context. Here we'll use allele frequency data with structurama to see if these mtDNA clades correspond to populations at the nuclear level. The dataset here has purposely been simplified and kept small to allow for quick run times, real analyses would usually require more loci.<br> + <br> + =Tutorial=<br> + This tutorial uses an example dataset entitled [[File(marm.in)]], this is the example dataset we'll be using. Copy this file into whatever directory that you have the ''Structurama'' software stored in.<br> + <br> + ==Input File Format==<br> + Start by opening a terminal window and going to the file containing the ''Structurama'' software and your datafile...<br> + <br> + {{{cd ~/desktop/structurama}}} (or wherever you've stored your copy)<br> + <br> + {{{less -e marm.in}}} (in windows just open in wordpad)<br> + This is the basic input file format for structure. You'll notice that it looks like a slight variant of standard NEXUS file format. The top of the file has information about the number of individuals and the number of loci. The data for each locus is coded using 1's and 0's (for the alternative alleles), and ?'s (missing data). For example, the first individual in the dataset, {{{SJ_0001_Merce}}} (Sample 1 from Merced county in the San Joaquin mtDNA clade), is homozygous for all loci except locus 4. The sixth individual is missing data for locus 4.<br> + <br> + {{{<br> + #NEXUS!<br> + <br> + begin data;<br> + dimensions nind=35 nloci=6;<br> + info<br> + SJ_0001_Merce (1,1) (1,1) (0,0) (0,1) (1,1) (1,1),<br> + SJ_0002_Merce (1,1) (1,1) (0,0) (0,0) (1,1) (1,1),<br> + SJ_0227_Kern (1,1) (1,1) (0,0) (1,0) (1,1) (1,1),<br> + SJ_3358_Kern (1,1) (1,1) (0,0) (0,0) (1,1) (1,1),<br> + SJ_1698_Tular (1,1) (1,1) (0,0) (1,1) (1,1) (1,1),<br> + SJ_1705_Tular (1,1) (1,1) (0,0) (?,?) (1,1) (1,1),<br> + ...<br> + ;<br> + end;<br> + }}}<br> + <br> + Notice that you can insert command blocks in the bottom of the file just like in PAUP or MrBayes. We're going to walk through each command one at a time so I have these commented out in the actual file.<br> + <br> + {{{<br> + begin structurama;<br> + model numpops=rv expectedpriornumpops=rv;<br> + mcmc ngen=10000 nchains=4 samplefreq=25 printfreq=1000;<br> + summarize burnin=100;<br> + end;<br> + }}}<br> + <br> + <br> + ==Starting the program and executing files==<br> + To start the program just type {{{./structurama_1.0}}} (on a mac) or {{{structurama_1.0.exe}}} (on a PC). You should see the program's title blurb. To get help at any point, simply type {{{help}}} or {{{help (command)}}}.<br> + <br> + Execute the datafile by typing {{{execute marm.in}}}.<br> + <br> + ==Setting the model==<br> + The model is set using the {{{model}}} command, we can see the options by typing {{{help model}}}. First we need to set the number of populations, K, for the run. We can set this to some fixed value (which makes the run equivalent to a run in ''Structure''), or we can set it to a random variable.<br> + <br> + If we set K to be a random variable we need to input some information about what prior distribution we want to use. We can do this by setting an {{{ExpectedPriorNumPops}}}, which simply sets the mean of the prior distribution to be some value that we think is likely. In some cases we might not have prior information about the number of populations and so we may want to set the expected number of populations to be a random variable itself. In this case a gamma distribution is used as a prior for the {{{ExpectedPriorNumPops}}}. If we go this route, we then need to give the program some information about the shape of this gamma distribution by using the {{{shape}}} and {{{scale}}} options.<br> + <br> + For this dataset we'll let both the {{{Numpops}}} and the {{{ExpectedPriorNumPops}}} be random variables and leave the {{{shape}}} and {{{scale}}} parameters on their defaults.<br> + <br> + {{{model Numpops=rv ExpectedPriorNumPops=rv shape=2.5 scale=0.5}}}<br> + <br> + ==Starting the run==<br> + After setting the model, we'll want to set up the MCMC run itself. Type {{{help mcmc}}}. You'll notice that these options are similar to those found in MrBayes, this is because we're using the same methods to get posterior distributions for our parameters. To keep runtimes short, we'll keep the number of generations low. We'll run 4 chains, leaving the temperature settings on the default. Notice that the outfile is automatically set to match your input file.<br> + <br> + {{{mcmc ngen=10000 nchains=4 samplefreq=25 printfreq=500}}}<br> + <br> + ==Summarizing the run==<br> + When the run finishes, the program will output a file ending in '.p', that contains information about the run. We'll want to summarize this information using the {{{summarize}}} command. The important parameter to set here is {{{burnin}}}. This sets the total number of samples to discard (NOT the number of generations). We ran our chain for 10000 generations, sampling every 25 so we have 400 samples. This is a simple dataset and the chain converges quickly so we don't need to worry too much about how much to throw out. In more complex datasets, you'd want to play with this parameter and make sure that you discard a sufficient number of samples.<br> + <br> + {{{summarize burnin=50}}}<br> + <br> + Notice that two tables are output. The first table contains information about the value of K. The first column of this table (labeled `i') shows different values of K, the second column shows the posterior probability for each value of K, and the third shows the prior probability for each value. Here we notice that K=3 has the highest posterior probability, which matches nicely with our prior expectation based on the mtDNA results.<br> + <br> + The second table shows the population that each individual was assigned in the 'mean partition'. At each step in the MCMC chain, the program assigns each individual to a cluster, forming a posterior distribution for the assignment of individuals to populations (i.e. partitions). At the end of the run, we want to summarize this information (just like we summarize the posterior distribution of trees in a MrBayes run). The way ''Structurama'' does this is to find the partition that minimizes the sum of squared distances between the sampled partitions, where the distance measure is simply the number of individuals that must be deleted between two partitions to make them the same. The partition that does this is called the mean partition.<br> + <br> + In our data, notice that all 'San Joaquin' samples are assigned to one population, most `Southern' samples to another, and most `Northern' samples to a third; again fitting nicely with our expectations. There are two individuals, however, that are assigned to a population that doesn't match the mtDNA results. Specifically, {{{SO_2972_San_B}}} is assigned to the northern population, even though this turtle was collected in the south and has southern mitochondria. We have independant evidence that the population in San Bernadino is actually largely the result of recent introductions from elsewhere in ''Emys marmorata'' 's range. Given this, we might expect to find turtles that get clustered into other populations. The second problematic individual ({{{NO_2938_Marip}}}) has northern mitochondria but is grouped in with the San Joaquin population. This individual was collected from Mariposa county which is on the southern edge of the contact zone between the Northern and San Joaquin mitochondrial clades. Given that information, we might interpret the unexpected grouping as mitochondrial introgression.<br> + <br> + The software outputs these two tables into a file ending in {{{.sum_assignments}}}. It also output a file ending in \verb!.sum_pairs!, which contains posterior probabilities of each pair of individuals belonging to the same cluster. Finally, the software outputs a file ending in {{{.sum_dist.nex}}} that shows the clustering of individuals as a distance matrix that can be read into PAUP and used to create a distance tree (this is NOT a phylogeny).<br> + <br> + =Conclusion=<br> + These are the basic steps for running ''Structurama''. The program is still quite new and will likely be extended in the future. Note that the model it implements is really quite simple, other software packages allow the incorporation of geographic data and/or phenotypic data. Some packages can infer admixture proportions or can deal with correlated allele frequencies between closely related populations. The importance of these extensions to the basic model vary depending on your particular dataset and question. Just keep in mind that there are many alternatives out there that may be more appropriate for a given study.<br> + <br> + Finally keep in mind that results of these analyses can be sensitive to the parameter values you give the program. Ideally, with enough data, the priors will have little effect on the results but you'll generally want to test this by running the programs under a range of settings to see how results change. For example, setting the expected number of populations to a value that is much higher than actual can bias you towards inferring a higher value of K than is actually appropriate (to see this yourself, try redoing this run with {{{expectedpriornumpops}}} set to 20).<br> + <br> + =Further Reading=<br> + Huelsenbeck, J. P., E. T. Huelsenbeck, and P. Andolfatto. ''In Press''. Structurama: Bayesian inference of population structure. Bioinformatics.<br> + Huelsenbeck, J. P., and P. Andolfatto. 2007. Bayesian inference of population structure under a Dirichlet process model. Genetics 175:1787-1802<br> + Pella, J., and M. Masuda. 2006. The Gibbs and split-merge sampler for population mixture analysis from genetic data with incomplete baselines. Can. J. Fish. Aquat. Sci. 63:576-596<br> + Pritchard JK, M Stephens, P Donnelly. 2000. Inference of Population Structure using multilocus genotype data. Genetics 155:945-959<br> + <br> + <br> + <br> + [[Comments]]</span> </td> </tr> </table> </div> Structuramahttp://bodegaphylo.wikispot.org/Structurama2008-03-19 12:54:36BobThomsonUpload of file <a href="http://bodegaphylo.wikispot.org/Structurama?action=Files&do=view&target=marm.in">marm.in</a>.