Recent Changes for "Structure" - Bodega Phylogenetics Wikihttp://bodegaphylo.wikispot.org/StructureRecent Changes of the page "Structure" on Bodega Phylogenetics Wiki.en-us Structurehttp://bodegaphylo.wikispot.org/Structure2010-08-03 06:11:01BobThomson <div id="content" class="wikipage content"> Differences for Structure<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 191: </td> <td> Line 191: </td> </tr> <tr> <td> <span>- ------<br> - ''2009-04-14 08:59:37'' [[nbsp]] Dear Sir,<br> - First thank you for this more explanation but could you please help me how to prepare an Input file for microsatelite data with 82 genotypes using 10 loci, especially that the 82 genotupes of almond (diploid) are divided into 5 different geographic origine because i have some troubles with Structure.<br> - Thanks in advance<br> - I'm Hassouna from Tunisia (North Africa)<br> - My mail is zallaouz@yahoo.fr --196.203.116.113<br> - ------<br> - ''2009-08-22 05:38:57'' [[nbsp]] Dear Sir, Please help me to prepare input file for SSR data<br> - SOURABH KUMAR DAS<br> - Research Fellow,<br> - Tocklai Experimental Station,<br> - Tea Research Association,<br> - Jorhat. Assam. India<br> - Sourabhkrdas@gmail.com --124.247.205.200</span> </td> <td> </td> </tr> </table> </div> Structurehttp://bodegaphylo.wikispot.org/Structure2009-08-22 05:38:57Comment added. <div id="content" class="wikipage content"> Differences for Structure<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 197: </td> <td> Line 197: </td> </tr> <tr> <td> </td> <td> <span>+ ------<br> + ''2009-08-22 05:38:57'' [[nbsp]] Dear Sir, Please help me to prepare input file for SSR data<br> + SOURABH KUMAR DAS<br> + Research Fellow,<br> + Tocklai Experimental Station,<br> + Tea Research Association,<br> + Jorhat. Assam. India<br> + Sourabhkrdas@gmail.com --124.247.205.200</span> </td> </tr> </table> </div> Structurehttp://bodegaphylo.wikispot.org/Structure2009-06-23 10:24:09BobThomson(quick edit) <div id="content" class="wikipage content"> Differences for Structure<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 160: </td> <td> Line 160: </td> </tr> <tr> <td> <span>-</span> For K=3, ''Structure'' cleanly splits the Southern CA population from San Joaquin, while the Santa Barbara population appears to be drawn from each of these clusters. The northern population remains relatively unchanged. So K=3 is a better fit statistically, and it appears to capture more st<span>ur</span>cture in the data than K=2. When we move to K=4, little changes except that the Northern/SB/Monterey cluster splits into two clusters with each individual admixed from both clusters. This occurs because the Northern population consists of largely invariant individuals. Because little variation exists, little change in Hardy-Weinberg equilibrium occurs by splitting the individuals into more and more populations and so the program chooses this as the least costly place to introduce an additional cluster. Clearly this doesn't make biological sense, so in this case I'd say K=3 provides the best fit to the data. </td> <td> <span>+</span> For K=3, ''Structure'' cleanly splits the Southern CA population from San Joaquin, while the Santa Barbara population appears to be drawn from each of these clusters. The northern population remains relatively unchanged. So K=3 is a better fit statistically, and it appears to capture more st<span>ru</span>cture in the data than K=2. When we move to K=4, little changes except that the Northern/SB/Monterey cluster splits into two clusters with each individual admixed from both clusters. This occurs because the Northern population consists of largely invariant individuals. Because little variation exists, little change in Hardy-Weinberg equilibrium occurs by splitting the individuals into more and more populations and so the program chooses this as the least costly place to introduce an additional cluster. Clearly this doesn't make biological sense, so in this case I'd say K=3 provides the best fit to the data. </td> </tr> </table> </div> Structurehttp://bodegaphylo.wikispot.org/Structure2009-04-14 08:59:37Comment added. <div id="content" class="wikipage content"> Differences for Structure<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 191: </td> <td> Line 191: </td> </tr> <tr> <td> </td> <td> <span>+ ------<br> + ''2009-04-14 08:59:37'' [[nbsp]] Dear Sir,<br> + First thank you for this more explanation but could you please help me how to prepare an Input file for microsatelite data with 82 genotypes using 10 loci, especially that the 82 genotupes of almond (diploid) are divided into 5 different geographic origine because i have some troubles with Structure.<br> + Thanks in advance<br> + I'm Hassouna from Tunisia (North Africa)<br> + My mail is zallaouz@yahoo.fr --196.203.116.113</span> </td> </tr> </table> </div> Structurehttp://bodegaphylo.wikispot.org/Structure2009-03-13 13:54:28glor <div id="content" class="wikipage content"> Differences for Structure<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 164: </td> <td> Line 164: </td> </tr> <tr> <td> <span>-</span> ==Conclusion== </td> <td> <span>+</span> ==Conclusion<span>s</span>== </td> </tr> </table> </div> Structurehttp://bodegaphylo.wikispot.org/Structure2009-03-13 13:54:00glor <div id="content" class="wikipage content"> Differences for Structure<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 16: </td> <td> Line 16: </td> </tr> <tr> <td> <span>-</span> =Structure=<br> - <span><br> - <br> - </span>==Getting Started== </td> <td> <span>+</span> <span>=</span>=Structure=<span>=</span><br> <span>+ </span>-<span>----<br> +</span> <span>=</span>==Getting Started==<span>=</span> </td> </tr> <tr> <td> Line 60: </td> <td> Line 59: </td> </tr> <tr> <td> <span>-</span> ==Extra Parameter File== </td> <td> <span>+</span> <span>=</span>==Extra Parameter File==<span>=</span> </td> </tr> <tr> <td> Line 67: </td> <td> Line 66: </td> </tr> <tr> <td> <span>-</span> ==Running Structure== </td> <td> <span>+</span> <span>=</span>==Running Structure==<span>=</span> </td> </tr> <tr> <td> Line 78: </td> <td> Line 77: </td> </tr> <tr> <td> <span>-</span> =Distruct= </td> <td> <span>+</span> <span>=</span>=Distruct=<span>=<br> + -----</span> </td> </tr> <tr> <td> Line 83: </td> <td> Line 83: </td> </tr> <tr> <td> <span>-</span> ==File Format== </td> <td> <span>+</span> <span>=</span>==File Format==<span>=</span> </td> </tr> <tr> <td> Line 94: </td> <td> Line 94: </td> </tr> <tr> <td> <span>- </span> </td> <td> </td> </tr> <tr> <td> Line 101: </td> <td> Line 100: </td> </tr> <tr> <td> <span>-</span> ==Distruct Parameter File== </td> <td> <span>+</span> <span>=</span>==Distruct Parameter File==<span>=</span> </td> </tr> <tr> <td> Line 137: </td> <td> Line 136: </td> </tr> <tr> <td> <span>-</span> =Inferring K= </td> <td> <span>+</span> <span>=</span>=Inferring K=<span>=</span> </td> </tr> <tr> <td> Line 165: </td> <td> Line 164: </td> </tr> <tr> <td> <span>-</span> =Conclusion= </td> <td> <span>+</span> <span>=</span>=Conclusion=<span>=<br> + -----</span> </td> </tr> <tr> <td> Line 168: </td> <td> Line 168: </td> </tr> <tr> <td> <span>-</span> =Downloading and Installing the Structure 2.2. Graphical Front End= </td> <td> <span>+</span> <span>=</span>=Downloading and Installing the Structure 2.2. Graphical Front End=<span>=</span> </td> </tr> <tr> <td> Line 176: </td> <td> Line 176: </td> </tr> <tr> <td> <span>-</span> =Further Reading= </td> <td> <span>+</span> <span>=</span>=Further Reading=<span>=<br> + -----</span> </td> </tr> <tr> <td> Line 178: </td> <td> Line 179: </td> </tr> <tr> <td> </td> <td> <span>+ </span> </td> </tr> <tr> <td> Line 179: </td> <td> Line 181: </td> </tr> <tr> <td> </td> <td> <span>+ </span> </td> </tr> <tr> <td> Line 180: </td> <td> Line 183: </td> </tr> <tr> <td> </td> <td> <span>+ </span> </td> </tr> <tr> <td> Line 181: </td> <td> Line 185: </td> </tr> <tr> <td> </td> <td> <span>+ </span> </td> </tr> </table> </div> Structurehttp://bodegaphylo.wikispot.org/Structure2009-03-13 13:51:32glor <div id="content" class="wikipage content"> Differences for Structure<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 1: </td> <td> Line 1: </td> </tr> <tr> <td> <span>-</span> ||&lt;bgcolor='#E0E0FF'&gt;'''Primary Contact(s)'''||<br> <span>-</span> ||["Users/BobThomson" Bob Thomson]||<span><br> - ||&lt;bgcolor='#E0E0FF'&gt;'''Created'''||<br> - ||</span>1 March 2007<span>||<br> - ||&lt;bgcolor='#E0E0FF'&gt;'''Required Software'''||<br> - </span>||[http://pritch.bsd.uchicago.edu/structure.html Structure] (command line version)<span>||<br> - ||</span>[http://rosenberglab.bioinformatics.med.umich.edu/distruct.html Distruct]<span>||<br> - ||&lt;bgcolor='#E0E0FF'&gt;'''Example Datafile'''||<br> - </span>||[[File(marm_struct.input)]]|| </td> <td> <span>+</span> ||&lt;bgcolor='#E0E0FF'&gt;'''Primary Contact(s)'''||<span>&lt;bgcolor='#E0E0FF'&gt;'''Created'''||&lt;bgcolor='#E0E0FF'&gt;'''Required Software'''||&lt;bgcolor='#E0E0FF'&gt;'''Example Datafile'''||</span><br> <span>+</span> ||["Users/BobThomson" Bob Thomson]||1 March 2007||[http://pritch.bsd.uchicago.edu/structure.html Structure] (command line version)<span>, </span>[http://rosenberglab.bioinformatics.med.umich.edu/distruct.html Distruct]||[[File(marm_struct.input)]]|| </td> </tr> <tr> <td> Line 11: </td> <td> Line 4: </td> </tr> <tr> <td> <span>-</span> <span><br> - </span>=Summary= </td> <td> <span>+</span> <span>=</span>=Summary=<span>=<br> + -----</span> </td> </tr> <tr> <td> Line 15: </td> <td> Line 8: </td> </tr> <tr> <td> <span>- [[TableOfContents]]<br> - <br> - <br> - =Introduction=</span> </td> <td> <span>+ ==Introduction==<br> + -----</span> </td> </tr> </table> </div> Structurehttp://bodegaphylo.wikispot.org/Structure2009-03-04 17:34:03BobThomson <div id="content" class="wikipage content"> Differences for Structure<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 178: </td> <td> Line 178: </td> </tr> <tr> <td> <span>- After</span> g<span>oing to all the trouble to produce a graphical front-end</span>, <span>Pritchard et al. stumble a few inches from the finish line when they pro</span>v<span>ide o</span>ve<span>rly brief installation instructions</span>. Users who are just becoming familiar with the basics of file architecture may benefit from the supplementary instructions below, which are particularly intended to help those running MacOSX on an Intel processor. </td> <td> <span>+ Pritchard et al. have developed a</span> g<span>raphical front end for ''Structure''</span>, <span>howe</span>v<span>er the installation instructions are brief and ha</span>ve<span>&nbsp;caused some students trouble</span>. Users who are just becoming familiar with the basics of file architecture may benefit from the supplementary instructions below, which are particularly intended to help those running MacOSX on an Intel processor. </td> </tr> </table> </div> Structurehttp://bodegaphylo.wikispot.org/Structure2008-07-06 16:48:24glor(quick edit) <div id="content" class="wikipage content"> Differences for Structure<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 183: </td> <td> Line 183: </td> </tr> <tr> <td> <span>- 4. Assuming everything worked, you should be able to launch the graphical front-end by typing "./structure" in the Terminal window from within the folder you uncompressed earlier.</span> </td> <td> <span>+ 4. Assuming everything worked, you should be able to launch the graphical front-end by typing "./structure" in the Terminal window. Of course, your working directory (i.e., the place you are in the Terminal) must correspond with the location of the structure executable, which should be the folder that you uncompressed earlier.</span> </td> </tr> </table> </div> Structurehttp://bodegaphylo.wikispot.org/Structure2008-07-06 16:46:01glor <div id="content" class="wikipage content"> Differences for Structure<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 19: </td> <td> Line 19: </td> </tr> <tr> <td> <span>- ''Structure'' is a versatile tool that can be used in several ways. Here we will use it to investigate the population structure present in a group of California turtles.</span> </td> <td> <span>+ ''Structure'' is a versatile tool that can be used in several ways. This tutorial shows how the command line version of "Structure" can be used with the sister program "Distruct" to investigate population structure in a group of California turtles. Although we advise anyone interested in "Structure" to learn the command line version of this software, we provide supplementary notes about installation and use of the "Structure" graphical front end as a foot-note.</span> </td> </tr> <tr> <td> Line 177: </td> <td> Line 177: </td> </tr> <tr> <td> </td> <td> <span>+ =Downloading and Installing the Structure 2.2. Graphical Front End=<br> + After going to all the trouble to produce a graphical front-end, Pritchard et al. stumble a few inches from the finish line when they provide overly brief installation instructions. Users who are just becoming familiar with the basics of file architecture may benefit from the supplementary instructions below, which are particularly intended to help those running MacOSX on an Intel processor.<br> + <br> + 1. Download the [http://pritch.bsd.uchicago.edu/software/structure2_2.html compressed installation file] to your desktop and decompress this file by double-clicking it (if you're OS hasn't done this automatically). Place this folder in MacOSXs Applications folder.<br> + 2. Open the MacOSX Terminal application and navigate to the folder you just unzipped. If you haven't changed the default terminal preferences (which should start you off in the home directory of the default user) and have absolutely no experience navigating around folders in the Terminal you should be able to do this by typing "cd ../../Applications/structure2.2.3_install" at the prompt.<br> + 3. To install the package, all you need to do is to use the Terminal to run the install script included in the unzipped folder. You may find that simply typing "./install" results in the following error "mkdir: /usr/local/Structure/: Permission denied." If this is the case, you can either change permissions on this folder, or simply install the required information elsewhere by typing the name of the desired location after "./install".<br> + 4. Assuming everything worked, you should be able to launch the graphical front-end by typing "./structure" in the Terminal window from within the folder you uncompressed earlier.<br> + </span> </td> </tr> </table> </div> Structurehttp://bodegaphylo.wikispot.org/Structure2008-03-25 12:11:47BobThomson(quick edit) <div id="content" class="wikipage content"> Differences for Structure<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 178: </td> <td> Line 178: </td> </tr> <tr> <td> <span>- These are the major background papers on this method, I've included them in the structure folder.</span> </td> <td> </td> </tr> </table> </div> Structurehttp://bodegaphylo.wikispot.org/Structure2008-03-25 12:11:01BobThomsonfixing figure layout <div id="content" class="wikipage content"> Differences for Structure<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 165: </td> <td> Line 165: </td> </tr> <tr> <td> <span>- [[Image(K_3.png, thumbnail, right, 300, noborder, "Figure 2 - Example output from distruct with K set to 3.")]]</span> </td> <td> <span>+ [[anchor(Figure3)]]</span> </td> </tr> <tr> <td> Line 167: </td> <td> Line 167: </td> </tr> <tr> <td> - <span>[[an</span>c<span>hor(Figure3</span>)]]<span><br> - </span>[[Image(K_4.png, thumbnail, right, 3<span>0</span>0, noborder, "Figure 3 - Example output from distruct with K set to 4.")]] </td> <td> <span>+ ||[[Image(K_3.png, thumbnail, right, 350, noborder, "Figure 2 </span>- <span>Example output from distru</span>c<span>t with K set to 3."</span>)]]<span>||</span>[[Image(K_4.png, thumbnail, right, 3<span>5</span>0, noborder, "Figure 3 - Example output from distruct with K set to 4.")]]<span>||</span> </td> </tr> </table> </div> Structurehttp://bodegaphylo.wikispot.org/Structure2008-03-25 12:07:30BobThomson(quick edit) <div id="content" class="wikipage content"> Differences for Structure<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 79: </td> <td> Line 79: </td> </tr> <tr> <td> <span>-</span> Before going any further, check the parameter estimates printed by the program throughout the run. You want to make sure that the estimates of alpha, D, and ln Likelihood have converged before the end of the burnin period is reached. It is also advisable to repeat runs several times to ensure that you get consistent answers. The run lengths I've recommended appear to be adequate for this dataset,<span><br> -</span> but you would need to worry about this more with a real dataset. </td> <td> <span>+</span> Before going any further, check the parameter estimates printed by the program throughout the run. You want to make sure that the estimates of alpha, D, and ln Likelihood have converged before the end of the burnin period is reached. It is also advisable to repeat runs several times to ensure that you get consistent answers. The run lengths I've recommended appear to be adequate for this dataset, but you would need to worry about this more with a real dataset. </td> </tr> </table> </div> Structurehttp://bodegaphylo.wikispot.org/Structure2008-03-25 12:06:53BobThomson(quick edit) <div id="content" class="wikipage content"> Differences for Structure<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 61: </td> <td> Line 61: </td> </tr> <tr> <td> <span>-</span> The next 14 parameters refer to the format of the data file, and tell the program what, precisely, we have in there. NUMINDS and NUMLOCI refer to the numbers of individuals and loci in your input data file. For our case,these should be set to 43 and 6 respectively. LABEL wants to know if we have labeled our individuals in the first column of the input file. We have, so it should be set to 1. Likewise, we've labeled individuals according to mtDNA clades so POPDATA should be set to 1. Note that we have the option of using this population data when inferring K and clustering individuals during the structure run. Here we are interested in letting the genetic data tell us what K is, so we won't use the population information.<span><br> -</span> If you did want to do this, you would set the USEPOPINFO parameter in the {{{extraparams}}} file to 1, or you could include the statement USEPOPINFO==1 in your inputfile, this is what the POPFLAG parameter refers to, we didn't use this flag so set it to 0. </td> <td> <span>+</span> The next 14 parameters refer to the format of the data file, and tell the program what, precisely, we have in there. NUMINDS and NUMLOCI refer to the numbers of individuals and loci in your input data file. For our case,these should be set to 43 and 6 respectively. LABEL wants to know if we have labeled our individuals in the first column of the input file. We have, so it should be set to 1. Likewise, we've labeled individuals according to mtDNA clades so POPDATA should be set to 1. Note that we have the option of using this population data when inferring K and clustering individuals during the structure run. Here we are interested in letting the genetic data tell us what K is, so we won't use the population information. If you did want to do this, you would set the USEPOPINFO parameter in the {{{extraparams}}} file to 1, or you could include the statement USEPOPINFO==1 in your inputfile, this is what the POPFLAG parameter refers to, we didn't use this flag so set it to 0. </td> </tr> </table> </div> Structurehttp://bodegaphylo.wikispot.org/Structure2008-03-25 12:06:33BobThomson(quick edit) <div id="content" class="wikipage content"> Differences for Structure<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 55: </td> <td> Line 55: </td> </tr> <tr> <td> <span>-</span> This ought to look familiar to those of you with experience programming in C, the define statement tells Structure that the value for MISSING data is -9. If you code your missing data some other way, simply change the number to match your coding scheme. Notice that each parameter takes a certain type of data, the options are integer (int), Boolean(B), or String(str), which means that you can't code your missing data with <span>`</span>?', you can only use integers. </td> <td> <span>+</span> This ought to look familiar to those of you with experience programming in C, the define statement tells Structure that the value for MISSING data is -9. If you code your missing data some other way, simply change the number to match your coding scheme. Notice that each parameter takes a certain type of data, the options are integer (int), Boolean(B), or String(str), which means that you can't code your missing data with <span>'</span>?', you can only use integers. </td> </tr> </table> </div> Structurehttp://bodegaphylo.wikispot.org/Structure2008-03-25 12:05:58BobThomson <div id="content" class="wikipage content"> Differences for Structure<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 40: </td> <td> Line 40: </td> </tr> <tr> <td> <span>-</span> This is the basic ''Structure'' input file format. Eight columns correspond respectively to the name of<span><br> - </span> the individual, which mtDNA clade each individual belongs to, and six SNP loci. Each SNP<span><br> - </span> locus is coded using 1's and 0's (for the alternative alleles), and -9's (for missing data). Each diploid individual takes up two rows in the data file, which allows you to code homozygous and heterozygous positions, e.g.: </td> <td> <span>+</span> This is the basic ''Structure'' input file format. Eight columns correspond respectively to the name of the individual, which mtDNA clade each individual belongs to, and six SNP loci. Each SNP locus is coded using 1's and 0's (for the alternative alleles), and -9's (for missing data). Each diploid individual takes up two rows in the data file, which allows you to code homozygous and heterozygous positions, e.g.: </td> </tr> </table> </div> Structurehttp://bodegaphylo.wikispot.org/Structure2008-03-25 12:05:34BobThomson(quick edit) <div id="content" class="wikipage content"> Differences for Structure<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 40: </td> <td> Line 40: </td> </tr> <tr> <td> <span>-</span> This is the basic <span>s</span>tructure input file format. Eight columns correspond respectively to the name of </td> <td> <span>+</span> This is the basic <span>''S</span>tructure<span>''</span> input file format. Eight columns correspond respectively to the name of </td> </tr> </table> </div> Structurehttp://bodegaphylo.wikispot.org/Structure2008-03-25 12:04:49BobThomson <div id="content" class="wikipage content"> Differences for Structure<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 23: </td> <td> Line 23: </td> </tr> <tr> <td> <span>-</span> The data set we are using consists of 6 SNP loci (4 nuclear, 2 mitochondrial) for 43 individuals of the western pond turtle (''Emys marmorata''). These turtles fall out into 5 well-supported mitochondrial DNA clades, which are not well supported on a gene-by-gene basis at the nuclear level. You'd want more loci in a real analysis, but I've kept the dataset small here to allow for reasonable run times. </td> <td> <span>+</span> The data set we are using consists of 6 SNP loci (4 nuclear, 2 mitochondrial) for 43 individuals of the western pond turtle (''Emys marmorata''). These turtles fall out into 5 well-supported mitochondrial DNA clades, which are not well supported on a gene-by-gene basis at the nuclear level. You'd want more loci in a real analysis, but I've kept the dataset small here to allow for reasonable run times.<span>&nbsp;The instructions ought to work for both OS X and Windows, though the Unix commands may not work for Windows users. I've tried to include instructions for both operating systems where it's critical.</span> </td> </tr> <tr> <td> Line 26: </td> <td> Line 26: </td> </tr> <tr> <td> <span>- This tutorial uses the example dataset called {{{marm_struct.input}}} that you can download above. The instructions ought to work for both OS X and Windows, though the Unix commands may not work for Windows users. I've tried to include instructions for both operating systems where it's critical.</span> </td> <td> <span>+ </span> </td> </tr> </table> </div> Structurehttp://bodegaphylo.wikispot.org/Structure2008-03-25 12:04:00BobThomson(quick edit) <div id="content" class="wikipage content"> Differences for Structure<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 23: </td> <td> Line 23: </td> </tr> <tr> <td> <span>-</span> The data set we are using consists of 6 SNP loci (4 nuclear, 2 mitochondrial) for 43 individuals of the western pond turtle (''Emys marmorata''). These turtles fall out into 5 well-supported mitochondrial DNA clades<span>&nbsp;(see last page of handout)</span>, which are not well supported on a gene-by-gene basis at the nuclear level. You'd want more loci in a real analysis, but I've kept the dataset small here to allow for reasonable run times. </td> <td> <span>+</span> The data set we are using consists of 6 SNP loci (4 nuclear, 2 mitochondrial) for 43 individuals of the western pond turtle (''Emys marmorata''). These turtles fall out into 5 well-supported mitochondrial DNA clades, which are not well supported on a gene-by-gene basis at the nuclear level. You'd want more loci in a real analysis, but I've kept the dataset small here to allow for reasonable run times. </td> </tr> </table> </div> Structurehttp://bodegaphylo.wikispot.org/Structure2008-03-25 11:57:58BobThomsonmove summary above table of contents <div id="content" class="wikipage content"> Differences for Structure<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 12: </td> <td> Line 12: </td> </tr> <tr> <td> </td> <td> <span>+ =Summary=<br> + ''This tutorial outlines the basics of running Jonathan Pritchard's ''Structure'' and Noah Rosenberg's ''Distruct'' software packages. It uses a small example dataset containing SNP data for Western Pond Turtles (''Emys marmorata'').''<br> + </span> </td> </tr> <tr> <td> Line 14: </td> <td> Line 17: </td> </tr> <tr> <td> <span>- =Summary=<br> - ''This tutorial outlines the basics of running Jonathan Pritchard's ''Structure'' and Noah Rosenberg's ''Distruct'' software packages. It uses a small example dataset containing SNP data for Western Pond Turtles (''Emys marmorata'').''</span> </td> <td> </td> </tr> </table> </div> Structurehttp://bodegaphylo.wikispot.org/Structure2008-03-25 11:57:07BobThomson(quick edit) <div id="content" class="wikipage content"> Differences for Structure<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 15: </td> <td> Line 15: </td> </tr> <tr> <td> <span>-</span> ''This tutorial outlines the basics of running Jonathan Pritchard's ''Structure'' and Noah Rosenberg's ''Distruct'' software packages.<span>''</span> </td> <td> <span>+</span> ''This tutorial outlines the basics of running Jonathan Pritchard's ''Structure'' and Noah Rosenberg's ''Distruct'' software packages.<span>&nbsp;It uses a small example dataset containing SNP data for Western Pond Turtles (''Emys marmorata'').''</span> </td> </tr> </table> </div> Structurehttp://bodegaphylo.wikispot.org/Structure2008-03-25 11:55:55BobThomsonwikifying 2007 tutorial on structure and distruct <div id="content" class="wikipage content"> Differences for Structure<p><strong></strong></p><table> <tr> <td> <span> Deletions are marked with - . </span> </td> <td> <span> Additions are marked with +. </span> </td> </tr> <tr> <td> Line 1: </td> <td> Line 1: </td> </tr> <tr> <td> </td> <td> <span>+ ||&lt;bgcolor='#E0E0FF'&gt;'''Primary Contact(s)'''||<br> + ||["Users/BobThomson" Bob Thomson]||<br> + ||&lt;bgcolor='#E0E0FF'&gt;'''Created'''||<br> + ||1 March 2007||<br> + ||&lt;bgcolor='#E0E0FF'&gt;'''Required Software'''||<br> + ||[http://pritch.bsd.uchicago.edu/structure.html Structure] (command line version)||<br> + ||[http://rosenberglab.bioinformatics.med.umich.edu/distruct.html Distruct]||<br> + ||&lt;bgcolor='#E0E0FF'&gt;'''Example Datafile'''||<br> + ||[[File(marm_struct.input)]]||<br> + <br> + <br> + [[TableOfContents]]<br> + <br> + =Summary=<br> + ''This tutorial outlines the basics of running Jonathan Pritchard's ''Structure'' and Noah Rosenberg's ''Distruct'' software packages.''<br> + <br> + =Introduction=<br> + ''Structure'' is a versatile tool that can be used in several ways. Here we will use it to investigate the population structure present in a group of California turtles.<br> + <br> + ''Structure'' works by grouping individuals into clusters such that Hardy-Weinberg equilibrium is maximized within clusters. It can use most commonly employed genetic markers as data, including AFLPS, SNPs, microsatellites, sequence haplotypes, Hapstrs, SNPstrs, etc. The program works by assuming a model of K populations, each of which are characterized by a set of allele frequencies at each locus. The program groups individuals into the K populations in such a way as to maximize HW equilibrium within populations. By varying the K parameter across several runs of the program and inspecting the resulting probabilities under different values, we can infer a likely value of K which best captures the variation present in the data. This program is often used in conjunction with a second program, ''Distruct'', which makes it easy to visualize the results. A final note before proceeding is that a graphical version of the program exists for windows and various flavors of Unix, we're going to use the command line version here because its more general, easier to customize, and easy to batch.<br> + <br> + The data set we are using consists of 6 SNP loci (4 nuclear, 2 mitochondrial) for 43 individuals of the western pond turtle (''Emys marmorata''). These turtles fall out into 5 well-supported mitochondrial DNA clades (see last page of handout), which are not well supported on a gene-by-gene basis at the nuclear level. You'd want more loci in a real analysis, but I've kept the dataset small here to allow for reasonable run times.<br> + <br> + =Structure=<br> + This tutorial uses the example dataset called {{{marm_struct.input}}} that you can download above. The instructions ought to work for both OS X and Windows, though the Unix commands may not work for Windows users. I've tried to include instructions for both operating systems where it's critical.<br> + <br> + ==Getting Started==<br> + Start by going into the folder containing the datafiles for structure...<br> + <br> + {{{cd ~/desktop/bodega_structure/datafiles/structure}}} (or wherever you've stored your copy)<br> + <br> + and list the contents...<br> + {{{ls}}} (or {{{dir}}} for you non-unix folks)<br> + <br> + You'll see three files; 2 parameter files for structure ({{{mainparams}}} and {{{extraparams}}}) and a datafile ({{{marm_struct.input}}}). The settings for each run are controlled by these two parameter files, which structure always looks for upon execution. It will also look for an input data file, the name of which you specify in the {{{mainparams}}} file.<br> + <br> + ==File Format==<br> + {{{open -e marm_struct.input}}} (or just open in wordpad in Windows)<br> + This is the basic structure input file format. Eight columns correspond respectively to the name of<br> + the individual, which mtDNA clade each individual belongs to, and six SNP loci. Each SNP<br> + locus is coded using 1's and 0's (for the alternative alleles), and -9's (for missing data). Each diploid individual takes up two rows in the data file, which allows you to code homozygous and heterozygous positions, e.g.:<br> + <br> + {{{<br> + 3259_Mader 1 1 1 0 0 -9 0<br> + 3259_Mader 1 1 1 0 1 -9 0}}}<br> + <br> + So individual 3259_Mader is in mtDNA clade 1, is homozygous for SNP loci 1,2,3, and 6; is heterozygous for locus 4; and is missing data for locus 5. Easy!<br> + <br> + Now on to setting up ''Structure'' 's two parameter files.<br> + <br> + ==Main Parameter File==<br> + This file contains the main parameters that you'll need to set in order to run the program. Each parameter looks something like:<br> + <br> + {{{#define MISSING -9 // (int) value given to missing genotype data}}}<br> + <br> + This ought to look familiar to those of you with experience programming in C, the define statement tells Structure that the value for MISSING data is -9. If you code your missing data some other way, simply change the number to match your coding scheme. Notice that each parameter takes a certain type of data, the options are integer (int), Boolean(B), or String(str), which means that you can't code your missing data with `?', you can only use integers.<br> + <br> + On to the parameters themselves...<br> + <br> + Most are self-explanatory, INFILE and OUTFILE correspond to your input datafile name and name of the file you want to output to, respectively. Note that the program will overwrite the outputfile if it already exists, this is important to remember when you are doing multiple runs at different values of K.<br> + <br> + The next 14 parameters refer to the format of the data file, and tell the program what, precisely, we have in there. NUMINDS and NUMLOCI refer to the numbers of individuals and loci in your input data file. For our case,these should be set to 43 and 6 respectively. LABEL wants to know if we have labeled our individuals in the first column of the input file. We have, so it should be set to 1. Likewise, we've labeled individuals according to mtDNA clades so POPDATA should be set to 1. Note that we have the option of using this population data when inferring K and clustering individuals during the structure run. Here we are interested in letting the genetic data tell us what K is, so we won't use the population information.<br> + If you did want to do this, you would set the USEPOPINFO parameter in the {{{extraparams}}} file to 1, or you could include the statement USEPOPINFO==1 in your inputfile, this is what the POPFLAG parameter refers to, we didn't use this flag so set it to 0.<br> + <br> + PHENOTYPE, EXTRACOLS, PHASEINFO, and MARKOVPHASE all refer to additional data that you can include in your data file for various reasons, none of which we've done here so they should all be 0.<br> + <br> + MISSING and PLOIDY refer to exactly what they sound like. Our missing data is coded as -9 and turtles are diploid. The last three formatting parameters refer to more file format options and data that can be included, set these to 0 for this exercise.<br> + <br> + Three final parameters in this file refer to the settings for the actual program run. You need to tell structure which value of K to assume using the MAXPOPS parameter, we want to infer the best value of K from the data so we'll run the program at several values and go back and compare the results (more on this below). Our mtDNA phylogeny suggests 5 clades so we'll run the program with MAXPOPS set at 1 through 5. Start with MAXPOPS equal to 2. Finally, we need to set the BURNIN and NUMREPS to do after the burnin. We're purposely using a simple dataset here so that we can run the program fast and complete this exercise in a reasonable amount of time. For us, these values can be set to 1000 and 100000 respectively. For more complex (i.e. realistic) datasets you'd want each of these to be an order of magnitude or so higher in order to make sure that the run converges. After the run finishes we'll check to make sure that we've run the markov chain long enough. Save and close this file.<br> + <br> + ==Extra Parameter File==<br> + This file is set up similarly to the {{{mainparams}}} file, it contains parameters concerning what model of population genetic structure ''Structure'' uses, output options, priors and a few other parameters. We won't go through them all, as you'll rarely need to change most of them. If you're still curious, the program's documentation goes through them in detail.<br> + <br> + One parameter to be concerned with here is FREQSCORR. Setting FREQSCORR to 1 causes structure to use a model in which the allele frequencies among closely related populations can be correlated. This can be useful in trying to differentiate between closely related populations, and other 'hard' structure problems. It is prone to overestimating K however so it should be used with care, particularly if you are interested in species boundaries.<br> + <br> + Enough parameters, lets run the program already...<br> + <br> + ==Running Structure==<br> + Structure automatically looks for the {{{mainparams}}} and {{{extraparams}}} files, as well as whatever input file you tell it (in the {{{mainparams}}} file) when you run the program. So just make sure these three files are in the same folder as the structure executable. If you're using OS X execute the program by typing {{{./structureOSX}}}, on a PC just type {{{structure.exe}}}. You'll see a lot of output on the screen, and when the run finishes it will print the results to the file you've specified.<br> + <br> + Before going any further, check the parameter estimates printed by the program throughout the run. You want to make sure that the estimates of alpha, D, and ln Likelihood have converged before the end of the burnin period is reached. It is also advisable to repeat runs several times to ensure that you get consistent answers. The run lengths I've recommended appear to be adequate for this dataset,<br> + but you would need to worry about this more with a real dataset.<br> + <br> + If the parameter estimates look like they've converged, go ahead and open your output file.<br> + <br> + {{{open -e marm_struct.output_f}}}<br> + <br> + This file has some information about the settings for the run, a table that shows the proportion of each predefined population assigned to each cluster, a table of likelihoods, a table of the proportion of each individual's alleles assigned to each cluster, as well as some other information. Compare your estimate for the ln probability of the data around before moving on, it may not be exactly the same for everyone but it ought to be close.<br> + <br> + =Distruct=<br> + Now that we've got output from ''Structure'' let's make one of those pretty figures that everyone's putting in their papers these days. This is done by giving some of the output from ''Structure'' to the separate program ''Distruct'', which uses this data to produce a figure showing the proportion of each individual's ancestry inferred to be from each cluster, as well as how this relates to your predefined populations. You can glean the same information by staring at the tables in the ''Structure'' output file, but thats no fun and this way is much easier.<br> + <br> + ''Distruct'' operates much like ''Structure'' in that it looks for a certain set of files when it runs. It has a single parameter file and needs two data files, it will also optionally take a few more files that tell it how to label populations, which colors to use, etc.<br> + <br> + ==File Format==<br> + The two datafiles are simply the individual and population tables from the ''Structure'' output file. Simply copy and paste these two tables into their own plain text files. You only want the data itself, NOT the labels and formatting that structure puts around it. Your population file should contain text that looks like the following:<br> + <br> + {{{1: 0.109 0.891 23<br> + 2: 0.075 0.925 2<br> + 3: 0.987 0.013 5<br> + 4: 0.969 0.031 10<br> + 5: 0.883 0.117 3}}}<br> + <br> + and your individual file should look like (but longer, this is only the top of the file):<br> + <br> + <br> + {{{1 0001_Merce (0) 4 : 0.970 0.030 (0.800,1.000) (0.000,0.200)<br> + 2 0002_Merce (0) 4 : 0.955 0.045 (0.694,1.000) (0.000,0.306)<br> + 3 0227_Kern (0) 4 : 0.970 0.030 (0.802,1.000) (0.000,0.198)<br> + 4 0331_San_D (0) 3 : 0.990 0.010 (0.941,1.000) (0.000,0.059)<br> + 5 0669_XX (0) 1 : 0.304 0.696 (0.000,0.814) (0.186,1.000) }}}<br> + <br> + ==Distruct Parameter File==<br> + Once you have each data file made, open ''Distruct'' 's parameter file, which is called {{{drawparams}}}. Input the names of the two datafiles into the first two parameters. The next two parameters want the names of files which contain labels for your predefined populations. Our predefined populations are labeld 1 through 5, but we can have distruct translate these numbers to geographic locations. So just make a file containing the text:<br> + <br> + {{{<br> + 1 North<br> + 2 San Bernardino<br> + 3 South<br> + 4 San Joaquin<br> + 5 Santa Barbara<br> + }}}<br> + <br> + save it, and input the filename into the INFILE_LABEL_BELOW parameter. We won't use any labels on top so skip the INFILE_LABEL_ATOP parameter. The INFILE_CLUST_PERM parameter takes a file which specificies the order in which to print the clusters as well as which colors to use. This file looks like<br> + <br> + {{{<br> + 1 red<br> + 2 green<br> + 3 yellow<br> + 4 blue<br> + 5 orange<br> + }}}<br> + <br> + The order of numbers controls the order in which the clusters are printed, and the color labels tell the program which color to use. Finally, give the program the name of a file to output to.<br> + <br> + The next section of the {{{drawparams}}} file wants the value of K used, set this to match whichever value of MAXPOPS you set back in the structure run, the NUMPOPS is 5 (the number of mtDNA clades), and the NUMINDS is 43. The remaining parameters are all printing options and are self-explanatory. Most can be left at the defaults, but make sure that you set PRINT_LABEL_ATOP to 0 and PRINT_LABEL_BELOW to 1, because this is how we've set up our label files.<br> + <br> + Now you should be all set to actually run the program.<br> + <br> + ==Running Distruct==<br> + Again, make sure all the necessary files are in the same folder and type {{{./distruct_OSX}}} or {{{distruct.exe}}} as the case may be...<br> + You should now have a postscript file named whatever you put into the {{{drawparams}}} file ([#Figure1]). If you're on a mac just open it with Preview, and it should automatically convert to a pdf and open. On a PC, you can use whatever postscript program you already have, or download Ghostscript and Ghostview, which is a free postscript viewing package.<br> + <br> + [[anchor(Figure1)]]<br> + [[Image(distruct_output.png, thumbnail, right, 400, noborder, "Figure 1 - Example output from distruct with K set to 2.")]]<br> + <br> + Each vertical bar represents a single individual. The individuals are grouped into our 5 predefined populations. The colors represent the proportion of each individual's loci that are drawn from each of K = 2 clusters. For example, the leftmost individual in the above figure is from the northern mtDNA clade, and about a third of it's alleles are drawn from the other cluster. With K=2, ''Structure'' lumps the the northern mtDNA clade and the San Bernadino and Monterey clade into one cluster and all the other clades into another. The San Bernadino/Monterey clade seems odd, however this clade is only represented by 2 individuals and there is some evidence that the San Bernadino population is non-natural. It's possible that the San Bernadino individual is a relatively recent introduction from the Monterey area. This aside, the groupings make good biological sense.<br> + <br> + =Inferring K=<br> + In order to infer the `best' value of K, we now need to do a run at each value of K that we are considering. Go back to you {{{mainparams}}} file change the value of K and the outputfile (so you don't overwrite your results), and repeat the run for K=1,...,5.<br> + <br> + There is no set way to choose the value of K which best fits the data, there are several informal guides that can be used however. You should always be careful in considering the biological reality of the `best' value of K, because ''Structure'' can be prone to overestimating K given even mild departures from the underlying model. However, when you're careful it appears to work well in practice. The goal should be to choose the lowest value of K which effectively captures the major structure present in the data. One way to do this is to compare the probabilities of the data given K for each value of K that you tried. It is often the case that the probability will be low at values of K less than the appropriate value, then plateau at and above the appropriate value. For example, in one set of my runs I got<br> + <br> + {{{<br> + K ln P(X|K)<br> + 1 -276.2<br> + 2 -186.7<br> + 3 -155.6<br> + 4 -155<br> + 5 -158.3<br> + }}}<br> + <br> + as you can see, ln P(X|K) plateaus around -155 at K=3. Alternatively, you can make this comparison by computing posterior probabilities using a uniform prior on K. I won't go through this here, however the calculation is straightforward and is laid out in detail in the ''Structure'' documentation. In practice, one tends to see the same pattern as for ln P(X|K), with very low P(K) at values less than the appropriate value, and then a plateau.<br> + <br> + So inspection of the probabilities suggests that K is around 3 or 4. Now lets look at the ''Distruct'' outputs and see if this is biologically realistic. We've already seen the breakdown for [#Figure1 K=2], [#Figure2 K=3] and [#Figure3 K=4] are printed below.<br> + <br> + [[anchor(Figure2)]]<br> + [[Image(K_3.png, thumbnail, right, 300, noborder, "Figure 2 - Example output from distruct with K set to 3.")]]<br> + <br> + [[anchor(Figure3)]]<br> + [[Image(K_4.png, thumbnail, right, 300, noborder, "Figure 3 - Example output from distruct with K set to 4.")]]<br> + <br> + <br> + For K=3, ''Structure'' cleanly splits the Southern CA population from San Joaquin, while the Santa Barbara population appears to be drawn from each of these clusters. The northern population remains relatively unchanged. So K=3 is a better fit statistically, and it appears to capture more sturcture in the data than K=2. When we move to K=4, little changes except that the Northern/SB/Monterey cluster splits into two clusters with each individual admixed from both clusters. This occurs because the Northern population consists of largely invariant individuals. Because little variation exists, little change in Hardy-Weinberg equilibrium occurs by splitting the individuals into more and more populations and so the program chooses this as the least costly place to introduce an additional cluster. Clearly this doesn't make biological sense, so in this case I'd say K=3 provides the best fit to the data.<br> + <br> + With more complex datasets, you may find that it's difficult to decide between two similar values of K. In these cases it can be useful to pull out the ambiguous populations and analyze them separately. This reduces the problem to a K=1 vs K=2 problem, which is generally easier to decide between than say K=15 vs. K=16.<br> + <br> + =Conclusion=<br> + These are the basic steps involved in using ''Structure'' and ''Distruct''. It is also possible to use this tool to assign individuals to populations as well as to do association tests (by using locality and phenotype data in the clustering process, respectively). Keys to remember are to always be cautious in the biological interpretation of K and carefully consider how you to set up the parameters of each run. Without careful thought, odd conclusions are likely. Even in the simple dataset we've used here, straightforward computation of the posterior on K would lead you to conclude that K=4 is more likely than K=3, a result that makes little biological sense. In larger more complex datasets, this problem can become more difficult to overcome. Additionally, with large datasets it becomes increasingly critical to check that your burnin is long enough and to perform multiple runs to make sure that the chain has converged. As you might guess, the run time for the program increases with the size of the data set, and with the size of K; so complex data sets can require seriously long run times.<br> + <br> + =Further Reading=<br> + These are the major background papers on this method, I've included them in the structure folder.<br> + -Pritchard JK, and W Wen. 2002. Documentation for ''Structure'' software: Version 2 ( ''Documentation for ''Structure''.'')<br> + -Rosenberg NA. 2003. ''Distruct'': a program for the graphical display of population structure (''Documentation for ''Distruct''.'')<br> + -Pritchard JK, M Stephens, P Donnelly. 2000. Inference of Population Structure using multilocus genotype data. Genetics 155:945-959 (''This is the original paper describing the method used by ''Structure'', if you want the mathematical nitty-gritty this is where its at. Its relatively accessible for such a math-y paper.'')<br> + -Falush D, M Stephens, JK Pritchard. 2003. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164: 1567-1587( ''This paper describes ''Structure'' version 2.0 which allows for correlations in allele frequencies between closely related populations as well as linkage.'')<br> + -Rosenberg NA, et al. 2001. Empirical evaluation of genetic clustering methods using multilocus genotypes from 20 chicken breeds. Genetics 159: 699-713( ''This is one of the early applications of the method using microsatellites for chicken breeds. Its a good example of using the method for complex datasets where K is large.'')<br> + <br> + <br> + <br> + [[Comments]]</span> </td> </tr> </table> </div> Structurehttp://bodegaphylo.wikispot.org/Structure2008-03-25 11:44:01BobThomsonUpload of image <a href="http://bodegaphylo.wikispot.org/Structure?action=Files&do=view&target=K_4.png">K_4.png</a>.Structurehttp://bodegaphylo.wikispot.org/Structure2008-03-25 11:43:39BobThomsonUpload of image <a href="http://bodegaphylo.wikispot.org/Structure?action=Files&do=view&target=K_3.png">K_3.png</a>.Structurehttp://bodegaphylo.wikispot.org/Structure2008-03-25 11:40:06BobThomsonUpload of image <a href="http://bodegaphylo.wikispot.org/Structure?action=Files&do=view&target=distruct_output.png">distruct_output.png</a>.Structurehttp://bodegaphylo.wikispot.org/Structure2008-03-25 11:13:20BobThomsonUpload of file <a href="http://bodegaphylo.wikispot.org/Structure?action=Files&do=view&target=marm_struct.input">marm_struct.input</a>.