SynMap: Difference between revisions

From CoGepedia
Jump to navigation Jump to search
No edit summary
Aicasti1 (talk | contribs)
 
(88 intermediate revisions by 6 users not shown)
Line 1: Line 1:
[[Image:Master 4241 4243.CDS-CDS.blastn geneorder D20 g10 A5.w500.png|thumb|right|500px|Syntenic dotplot generated by SynMap between two substrains of Escherichia coli K12, DH10B and W3110.]]
[[Image:Synmap1.png|thumb|right|500px|Syntenic dotplot generated by SynMap between two substrains of Escherichia coli K12, DH10B and W3110. Results can be regenerated at https://genomevolution.org/r/dfjw]]


==Overview==
==Overview==
Line 10: Line 10:
If you choose, [[synonymous]] and non-synonymous site mutation data can be calculated for protein coding genes that are identified as syntenic.  These genes will then be colored based on those values in the dotplot for rapid identification of different age-classes of syntenic regions.
If you choose, [[synonymous]] and non-synonymous site mutation data can be calculated for protein coding genes that are identified as syntenic.  These genes will then be colored based on those values in the dotplot for rapid identification of different age-classes of syntenic regions.


==Specifying genomes==
==SynMap Methods==
[[Image:SynMap select organisms.png|thumb|right|800px|Selecting organisms in SynMap is done under the "select organisms" tab.]]
#Extract sequences for comparison; build fasta files
It is easy to search and select organisms in SynMap. Just select the "select organisms" tab and search for two organisms by either name or [[organism description]].  Just type part of either in the text box and SynMap automatically searches for any organism that contains your text. The results are displayed below the search boxes.  Some organisms may have multiple versions of their genome and different types of sequences (e.g. masked versus unmasked). These will be displayed in a drop-down menu from which you can select the correct genome. Also, this is where you can select for comparing [[CDS]] sequence or genomic sequence. If the genome does not have [[CDS]] features, the option won't be display and a warning will be printed in below these drop-down menus.  When selected, a brief description of the genome will be displayed below the drop-down menusThis will include the full organism name and description, followed by an overview of the genomeIf you click on "Genome Information:" it will link to [[OrganismView]] and give you a full description of the genome.  Otherwise, it will display source of the genome, the number of chromosomes in the genome, the total length of the genome in nucleotides, and whether the genome contains plasmids or [[contigs]].
#Create blastable databases (if necessary) and compare using
##Last, a variant of Blast, much faster than lastz, nearly as sensitive.  Last Homepage: http://last.cbrc.jp/
##[http://www.bx.psu.edu/miller_lab/ lastz], a variant of blastz <ref name="blastz">[http://genome.cshlp.org/content/13/1/103 Schwartz, S. et al. Human-mouse alignments with BLASTZ. Genome Res. 13, 103−107 (2003)]</ref>
##MegaBlast
##Discontinuous MegaBlast
##TblastX
##BlastP
##BlastN <ref name="blast">Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. Journal of Molecular Biology 215:403-410 </ref>. SynMap uses default blast parameters with an e-value cutoff of 0.001.
#Identifies [[tandem gene duplicates]] using a program written by Haibao Tang and Brent Pedersen called blast2raw. Tandem gene duplicates are then condensed and treated as a single gene. These can later be output and downloaded in a file to be investigated further.
#Filters repetitive matches, e.g. high copy number genes, using a program written for SynMap by Brent Pedersen
#Identify syntenic pairs of by finding collinear series of putative homologous sequences using DAGChainer <ref name="dagchainer">Haas BJ, Delcher AL, Wortman JR, Salzberg SL (2004) DAGchainer: a tool for mining segmental genome duplications and synteny. Bioinformatics 20: 3643–3646</ref>
#Optional: calculate synonymous and nonsynonymous mutation rates for syntenic gene pairs using CodeML of the PAML package<ref name="paml">Yang Z (2007) PAML 4: Phylogenetic Analysis by Maximum LikelihoodMolecular Biology and Evolution 24:1586-1591</ref>
#Generate dotplot of all putative homologous matches; dots are colored gray
#Add syntenic pairs; dots are colored either green and red (same or opposite orientation), or based on the synonymous mutations, nonsynonymous mutations, or the ratio of nonsynonymous/synonymous mutations


==DAGChainer options==
{{reflist}}
[[Image:SynMap DagChainer Options.png|thumb|right|600px|SynMap's panel for specifying DAGChainer options.]]
 
[http://dagchainer.sourceforge.net/ DAGChainer] is the algorithm used for finding series of collinear genes (or regions of sequence similarity) between two genomes in order to infer [[synteny]].   To specify DAGChainer options, select the "DAGChainer Options" tab.
== Specifying genomes  ==
 
[[Image:SynMapForm1.png|thumb|right|800px]] It is easy to search and select organisms in SynMap. Just select the "select organisms" tab and search for two organisms by either name or [[Organism description]]. Just type part of either in the text box and SynMap automatically searches for any organism that contains your text. The results are displayed below the search boxes. Some organisms may have multiple versions of their genome and different types of sequences (e.g. masked versus unmasked). These will be displayed in a drop-down menu from which you can select the correct genome. Also, this is where you can select for comparing [[CDS]] sequence or genomic sequence. If the genome does not have [[CDS]] features, the option won't be displayed and a warning will be printed in below these drop-down menus. When selected, a brief description of the genome will be displayed below the drop-down menus. This will include the full organism name and description, followed by an overview of the genome. If you click on "Genome Information:" it will link to [[OrganismView]] and give you a full description of the genome. Otherwise, it will display source of the genome, the number of chromosomes in the genome, the total length of the genome in nucleotides, and whether the genome contains plasmids or [[Contigs]].
 
==QuotaAlign options==
[[QuotaAlign]] is a post-processing step to merge adjacent syntenic blocks, or select subset of syntenic blocks that reflect matching ratio of regions (for example, number of subgenomes in duplicated genomes). These two steps are essential for downstream analysis of genome rearrangements. Read more details on [[QuotaAlign]].
 
*NOTE: If you select a Quota Ratio of 1:1, a link will appear in the "Links and Downloads" section of the results called "Rearrangement Analysis (powered by GRIMM!)"  This link will take you to the [http://grimm.ucsd.edu/GRIMM/ GRIMM: Genome Rearrangement Analysis in Man and Mouse server] and automatically populate their submission box with your genomes from SynMap appropriate coded for rearrangement analysis.  Please note that their algorithm will work on any pairs of genomes with a 1:1 syntenic mapping.
**Tutorial on [[Genomic Rearrangement Analysis]] using SynMap and [http://grimm.ucsd.edu/GRIMM/ GRIMM].
 
==SynMap Options==
===Analysis Options===
*'''Blast Algorithm''':
**(B)LastZ:  Nucleotide-nucleotide search.  This is the LastZ implementation that has been parallelized to break up the query sequences into multiple pieces for searching.  '''This is usually the best algorithm''' to pick in terms of sensitivity and speed.
**MegaBlast:  Nucleotide-nucleotide search.
**Discontinuous Megablast:  Nucleotide-nucleotide search
**BlastN: Nucleotide-Nucleotide search
**BlastP:  Protein-Portein search
**TblastX: Translated Nucleotide-Translated Nucleotide search
***tblastx takes 6 times longer than blastn and usually doesn't improve the ability to find synteny.  If the DNA sequence is that diverged at the nucleotide level that a protein sequence search is needed to find putative homologous sequence, usually the genome structure is also very divergent and collinear gene sets are not likely to be found.
*'''Filter Repetitive Matches''':  This option adjust the e-values of the blast hits to lower the significance of sequences that occur multiple times in a genome
*'''DAGChainer options''': 
**DAGChainer is the algorithm to identify syntenic regions between genomes.  It works by searching some distance from a pair of genes for another pair.  If some threshold number of gene pairs are identified, DAGChainer keeps that set of gene-pairs which will be reported back to SynMap.  These sets of gene pairs are interpreted as two syntenic regions, and get colored in the dotplot for easy identification.
**"Relative Gene Order" and "Nucleotide Distance".
***This determines whether DAGChainer is searching by number of genes (or regions of sequence similarity) or by absolute genomic nucleotide distances.  For the most part, using "relative gene order" is preferable.  The absolute distance between genes in nucleotides varies widely between genomes, and even within a genome.  For the former, gene spacing in a bacteria genome are orders of magnitude closer than for an animal genome.  For the latter, genes near centromeres are often further from one another than genes distal to a centromere.  This is very apparent in plant genomes.
**'''Average distance expected between syntenic genes'''
**'''Maximum distance between two matches'''
***These two options are self explanatory.  One thing to keep in mind is that the larger these values are set, the more generous DAGChainer is in including genes in a collinear set.  In other words, as these values are increased, your false positive rate goes up.  Also, the "average distance" should be lower than "maximum distance".
**'''Minimum number of aligned pairs'''
***This is the minimum number of gene-pairs that DAGChainer needs in a collinear gene set to keep.  The higher this number, the more stringent DAGChainer will be for finding "good" collinear gene sets.
*'''Merge Syntenic Blocks''':  Merges neighboring syntenic blocks into larger blocks.  Useful if you are performing statistics on syntenic blocks.
**'''Quota Align Merge''': Use Haibao Tang's Quota Align method to merge syntenic blocks (recommended).  Merge distance is the number of genes or nucleotides to use to link a neighboring syntenic region
**'''Iterative DAGChainer''':  Recode the blocks into a new input file for DAGChainer and run DAGChainer again
*'''[[Syntenic Depth]] with [[Quota align]]''':
**This algorithm removes syntenic regions by enforcing a defined syntenic relationship between genomes.  For example, a 1:1 [[syntenic depth]] will find a keep the best syntenic regions that cover each genome once.  A 1:2 [[syntenic depth]] will find the best syntenic regions that cover one genome once and the other genome twice.  This option is useful if you want to minimize noise in an analysis with false positive syntenic regions, are testing for multiple polyploidy events, or trying to mask older polyploidy events from an analysis.
**Overlap Distance:  How much syntenic regions may overlap and still be permitted.  Some overlap is recommended as the algorithms for finding syntenic regions may have fuzzy ends that are have some overlap with one another.
**Tutorial on [[Genomic Rearrangement Analysis]] using SynMap and [http://grimm.ucsd.edu/GRIMM/ GRIMM].
*'''[[FractBias]]''':
**This algorithm calculates and plots fractionation bias between the selected genomes. In order to run, Syntentic Depth must be set to specify which genome is used as the reference (syntenic depth 1) and which genome will be assessed for fractionation bias (syntenic depth >1).
**For details on this tool, please see: [[FractBias]]
*CodeML:  Turn on this option to calculate synonymous and non-synonymous mutation rates for identified syntenic gene pairs.  The syntenic gene pairs get colored according to these values and a histogram with those color-metrics is displayed below the syntenic dotplot.
 
====Advanced Options====
*Tandem duplication distance:  This is the distance searched (in genes) when identifying tandem gene duplicates.  A value of 10 means that if there are genes within 10 genes of one another that both match the same gene in the other genome, they will be considered tandem duplicates and condensed to a single representative.  This option is used by blast_to_raw.py
*C-score:  This option filters low quality blast hits using a cutoff value between 0 and 1.  From the documentation of blast_to_raw.py:
see supplementary info for sea anemone genome paper <http://www.sciencemag.org/cgi/content/abstract/317/5834/86>
formula below:
  cscore(A,B) = score(A,B) / max(best score for A, best score for B)
**If a c-score value of 0.5 is used, this means that any hit to a given gene will be filtered if its score is less than 50% of that gene's best score.


==SynMap options==
===Display Options===
*'''Regenerate dotplot images''':  Select this if you want the dotplot image regenerated.  This are saved and can be out of date with regards to the methods used to generate them, or something bad happened during their original generation
*'''Show non-syntenic matches (grey dots)''': Select this option if you wish to see all the putative homologous matches on the dotplot drawn as grey dots.  Othewise, only syntenic matches are shown.  This option may increase the time it takes to generate the dotplot as the additional dots are identified and drawn.
*'''Draw boxes around syntenic regions? ''':  This option draws boxes around identified syntenic blocks.  Using this option is useful when determining whether to merge neighboring syntenic blocks and how much overlap neighboring blocks may have during merging.
*'''Sort Chromosomes by''': Are chromosomes or contigs on the axes of the syntenic dotplot ordered by size or name.
*'''Flip axes''': Do you want to exchange the x and y axes?
*'''Color diagonals by''': By default, syntenic dots (diagonals) are colored green.  You can select to color them by:
**Inversion:  positive sloped lines are green, negative sloped lines are blue
**Diagonals:  Each syntenic block gets a different color (actually, cycles through 6 colors).  This is a method for visualizing the identified syntenic blocks.  Try this and the "Draw boxes around syntenic regions".
*'''Dotplot axis metric:'''  Are the dotplot axis to be measured in nucleotide or gene units.  Nucleotides accurately reflect the structure of the genome.  Gene units help make collinear lines straighter if there are regions of the genome undergoing different rates of expansions (e.g. centromeres). 
*'''Master image width:'''  How wide, in pixels, will the master dotplot image be?  Dynamic scales it according to the size of the genome, but sometimes it is necessary to make them very large to see some details.
*'''Calculate substitute rates for syntenic protein coding gene pairs and color syntenic dots accordingly'''
**This is the option to set if you want to calculate synonymous, nonsynonymous, or the ratio of those and color syntenic gene pair dots based on those values.  This is very helpful if there are whole genome duplication events in one or both genomes.  When these values are generated, a histogram of the values will also be generated and displayed.  The histogram will be color coded identically to the dots in the dotplot.  This makes it easy to determine which sytnenic gene pair dots are older or newer than other syntenic dots. For an example, please [[SynMap#Example_Results | see below]].  PLEASE NOTE:  generating these values takes a long time.  It takes a very long time for large genomes.
*'''Order contigs by best syntenic path:'''  this option will invoke a special algorithm to rearrange contigs in order to create the best order of contigs that make a continous syntenic path in relation to a reference genome.  This is very useful for a genome sequences that have been assembled to contigs for which a reference genome is available.  PLEASE NOTE: that this ordering of contigs may not be how the genome is actually structured.  Some genomic changes, such as large-scale [[inversions]] will not be correctly placed if the contigs' break-points are at the end of the inversion.  This is often the case as inversions happen in repetitive sequence, and these same repetitive sequence are often where genome assembly algorithms cannot assemble through.  For an example, see [[Syntenic path assembly]].


==Calculating and displaying synonymous/non-synonymous (Ks, Kn) data==
==Calculating and displaying synonymous/non-synonymous (Ks, Kn) data==
===This option is selected under the "SynMap Options" tab===
'''This option is selected under the "SynMap Options" tab'''
To select, just set the option "Calculate substitution rates for syntenic protein coding gene pairs and color syntenic dots accordingly" to what rate you desire:
To select, just set the option "Calculate substitution rates for syntenic protein coding gene pairs and color syntenic dots accordingly" to what rate you desire:
*Ks -- synonymous
*Ks -- [[synonymous]] mutation rate
*Kn -- non-synonymous
*Kn -- [https://genomevolution.org/wiki/index.php/Synonymous/Nonsynonymous_Mutations_%28Ks/Ka%29 non-synonymous] mutation rate (also known as Ka)
*Kn/Ks -- ratio of non-synonymous to synonymous changes.  This is often used to detected neutral (Kn/Ks == 1), positive (Kn/Ks > 1), and purifying (Kn/Ks < 1) selection acting on a pair of genes.
*Kn/Ks -- ratio of non-synonymous to synonymous changes.  This is often used to detected neutral (Kn/Ks == 1), positive (Kn/Ks > 1), and purifying (Kn/Ks < 1) selection acting on a pair of genes.


Line 33: Line 107:


==Running SynMap==
==Running SynMap==
Just press "Generate SynMap" when everything is configured.  While SynMap runs, messages will be displayed as to the stage of the analysis that is being processed.
Just press "Generate SynMap" when everything is configured.  While SynMap runs, messages will be displayed as to the stage of the analysis that is being processed.


==Interacting with results==
==Interacting with results==
'''Main results'''
#Cross-hairs help visually align syntenic region from different parts of the genome.
#Clicking on a chromosome panel in the initial results creates a zoomed-in panel.
'''Zoomed-in results'''
#Cross-hairs turn red when the mouse is over a gene pair that is a link to [[GEvo]] with that pair pre-loaded.  [[GEvo]] can then be used to perform a high-resolution analysis of those genomic regions.


==Example Results==
==Output Files==
[[Image:PastedGraphic-1-.png|thumb|center|200px|Diagram showing ancestral WGD events, alpha and beta, before lineage divergence in ''Arabidopsis'']]
SynMap provides links to all the output files generated during its analysis.  Most of the files are located in a box below the dotpot named "Links and Downloads".
 
===Log File===
*'''Analysis Log (id: SynMap_52205333)''': 
**This file contains a list of every computational action taking throughout SynMap's analytical pipeline.  This includes the exact command line used to run each program.  This file is very useful for reproducing an analysis yourself, and figuring out errors in the pipeline should they occur.
 
####################
Runing genome comparison
Running (B)lastZ
running:
/usr/bin/python /opt/apache/CoGe/bin/blastz_wrapper/blastz.py -A 32 --path=/usr/bin/lastz -i /opt/apache/CoGe/data/fasta//7118-CDS.fasta -d /opt/apache/CoGe/data/fasta//4243-CDS.fasta -o /opt/apache/CoGe/diags//Escherichia_coli_K12_strain_K-12_substrain_NCM4781/Escherichia_coli_K12_strain_K-12_substrain_W3110/7118_4243.CDS-CDS.lastz
blastfile /opt/apache/CoGe/diags//Escherichia_coli_K12_strain_K-12_substrain_NCM4781/Escherichia_coli_K12_strain_K-12_substrain_W3110/7118_4243.CDS-CDS.lastz already exists
Completed blast run(s)
####################
####################
Creating .bed files
Creating bed files: /opt/apache/CoGe/bin/SynMap/blast2bed.pl -infile /opt/apache/CoGe/diags//Escherichia_coli_K12_strain_K-12_substrain_NCM4781/Escherichia_coli_K12_strain_K-12_substrain_W3110/7118_4243.CDS-CDS.lastz -outfile1 /opt/apache/CoGe/diags//Escherichia_coli_K12_strain_K-12_substrain_NCM4781/Escherichia_coli_K12_strain_K-12_substrain_W3110/7118_4243.CDS-CDS.lastz.q.bed -outfile2 /opt/apache/CoGe/diags//Escherichia_coli_K12_strain_K-12_substrain_NCM4781/Escherichia_coli_K12_strain_K-12_substrain_W3110/7118_4243.CDS-CDS.lastz.s.bed
####################
 
===Homolog search===
These are the data files that are used and processed to identify putative homologous genes between the selected genomes
 
*'''Fasta file for Escherichia coli K12 strain K-12 substrain DH10B (v1): CDS'''
*'''Fasta file for Escherichia coli K12 strain K-12 substrain MG1655 (v2): CDS'''
**These are the fasta files used in the whole genome comparison.  May be CDS or genomic sequences.
*'''Unfiltered (B)lastZ results'''
**This is the output from the whole genome comparison.  The output will depend on the sequence comparison algorithm selected. This output is in tab-delimited format and is the raw blast output. The headers of the columns for the output file are: query ID(chr|start|stop|name|strand|feat type|database id|gene order)<tab>subject id (chr|start|stop|name|strand|feat type|database id|gene order)<tab>Percent identity<tab>Alignment Length<tab>Mismatch count<tab>gap open count<tab>query start<tab>query end<tab>Subject start<tab>Subject end<tab>eValue<tab>Bit Score.
*'''Filtered (B)lastZ results (no tandem duplicates)'''
*'''Tandem Duplicates for Escherichia coli K12 strain K-12 substrain DH10B (v1)'''
*'''Tandem Duplicates for Escherichia coli K12 strain K-12 substrain MG1655 (v2)'''
**Tandem duplicates are removed from the whole genome comparison.  The filtered blast file is sent to DAGChainer, and the tandem duplicates are saved for each genome.  Each line of the tandem duplicate file contains one set of tandem duplicates and links to CoGe to extract those sequences.  The format for the tandem duplicate files is as follows:
 
#FeatList_link  GEvo_link      FastaView_link  chr||start||stop||name||strand||type||database_id||gene_order
http://genomevolution.org/CoGe/FeatList.pl?fid=41633483;fid=41633480;  http://genomevolution.org/CoGe/GEvo.pl?fid1=41633483;fid2=41633480;num_seqs=2  http://genomevolution.org/CoGe/FastaView.pl?fid=41633483;fid=41633480;  1||1009879||1010469||E4781_964||-1||CDS||41633480||974  1||1010569||1010712||E4781_965||-1||CDS||41633483||975


**FeatList_link:  Generate a feature lits of the tandem duplicates (this tool links to many other tools in CoGe for analyzing and processing lists of genomic features)
**GEvo_link:  Compare the tandem duplicates and their genomic region
**FastaList_link: Generate fasta sequences of the tandem duplicates
**Each tandem duplicate the follows these links and contain the following information delimited by "||"
***Chromosome
***Start position
***Stop position
***Strand
***Genomic feature type (CDS, tRNA, rRNA)
***CoGe database id (useful for constructing links to CoGe's various tools
***Order within the genome


===Diagonals===
These are the files that will be sent into DAGChainer to identify syntenic regions and gene pairs.  Different files will be present depending on the options used in the analysis:
*'''DAGChainer Initial Input file'''
**The whole genome comparison file (post tandem gene removal) converted into the input type for DAGChainer
*'''DAGChainer Input file converted to gene order'''
**If relative gene order is used to specify gene positions within a genome, this file will be generated.  The chromosomal start/stop positions will be replaced with the gene position along a chromosome.  For example if the position is 1123, that feature is the 1123rd feature along that chromosome.
*'''DAGChainer Input file post repetivive matches filter'''
**This file has the e-values adjusted such that genomic features which match multiple regions in the genome get higher (less significant) e-values.  This is the "Filter repetitive matches" in the "Analysis Options" tab.


[[Image:Master_3068_8.CDS-CDS.blastn_geneorder_D20_g10_A5.w2000.ks.png|thumb|right|600px|Figure 1a: Syntenic dotplot between ''Arabidopsis thaliana'' and ''Arabidopsis lyrata''.  Syntenic gene pairs identified by DAGChainer have been colored based on their synonymous rate change as calculated by CODEML.  Results can be regenerated [http://synteny.cnr.berkeley.edu/CoGe/SynMap.pl?dsgid1=3068;dsgid2=8;D=20;g=10;A=5;w=0;b=1;ft1=1;ft2=1;dt=geneorder;ks=1;autogo=1 here]. ]]
Example DAGChainer input line:


[[Image:Master 3068 8.CDS-CDS.blastn geneorder D20 g10 A5.w2000.ks.hist-1.png|thumb|right|600px|Figure 1b: Histogram of synonymous rate change data for syntenic gene pairs between Arabidopsis thaliana and Arabidopsis lyrata.  The two obvious peaks in the distribution are from syntenic gene pairs (syntelogs) derived from the speciation of these two taxa and from their shared most recent whole genome duplication event, known as alpha.]]
a7118_1 1||2588568||2593148||E4781_2508||-1||CDS||41638112||2528||100.0 2528    2528    b4243_1 1||2776802||2781382||AP_003226.1||-1||CDS||24957549||2612||100.0        2612    2612    0


[[Image:SynPlot-Chr1-At-v-Al.png|thumb|right|600px|Figure 2: Syntenic dotplot and synonymous substitution histogram of Chromosome 1 of Arabidopsis thaliana and Arabidopsis lyrata. Syntenic gene pairs were identified by DAGChainer, and colored based on their synonymous substitution rate as calculated by CodeML. Syntenic regions derived from speciation (orthologs) or from the shared whole genome duplication events (alpha and beta) are labeled. These images were generated by SynMap and can be viewed at generated by SynMap at http://toxic.berkeley.edu/CoGe//diags/Arabidopsis_lyrata/Arabidopsis_thaliana_thale_cress/html/3068_8.CDS-CDS.blastn.dag_geneorder_D20_g10_A5.scaffold_1-1.w600.ks.html]]
*<org a chromosome>: This is coded with an "a" or "b", followed by the coge database id for the genome, followed by an "_", followed by the chromosome name. This is create a unique "name" for each chromosome in DAGChainer even if the same genome is compared against itself
*<org a gene name>:  This is a "||" delimited format that contains the following information:
**Chromosome
**Start position
**Stop position
**Gene name
**Strand
**Genomic feature type (CDS, tRNA, rRNA)
**CoGe database id (useful for constructing links to CoGe's various tools
**Order within the genome
**Blast hit identity of pair
*<org a gene start position>
*<org a gene stop position>:  Note: will be same as start if relative gene order on chromosome is used (as shown in the example above)
*<org b chromosome>
*<org b gene name>
*<org b gene start position>
*<org b gene stop position>
*<blast e-value>


[[Image:GEvo-At-Al.png|thumb|right|600px|Figure 3: GEvo analysis of syntenic regions from Arabidopsis thaliana and Arabidopsis lyrataAt1-Al1 and At2-At2 are derived from the speciation of the taxaThis is evidenced by the strong spatial syntenic signal as seen by the extensive collinear arrangement of orthologous gene pairs (pink and blue lines). These pairs of syntenic regions are also syntenic with respect to one another, and are derived from the most recent whole genome duplication event (WGD) shared by both taxa. This event is known as "alpha".  Green lines connecting the out-paralogous gene pairs are shown between Al1 and At2. While also collinear, there is a lower density of syntenic gene pairs than between either At1-Al1 or At2-Al2.  Al3 is also syntenic to these regions, but is derived from prior whole genome duplication event known as beta.  Dark blue lines have been drawn connecting syntenic gene pairs between Al3 and Al2, and show the lowest density of all.  Results can be regenerage at http://tinyurl.com/lvpzln .]]
===Results===
*'''DAGChainer output'''
**Output generated by DAGChainer.
#1      133556.0        a4241_1 b7117_1 f      2673
a4241_1 1||1397989||1398285||ECDH10B_1366||-1||CDS||24909233||1293||99.7        1293    1293    b7117_1 1||1290089||1290385||E4401_1219||-1||CDS||41621513||1231||99.7  1231    1231    1.000000e-250  50
a4241_1 1||1398509||1399228||ECDH10B_1367||1||CDS||24909270||1294||100.0        1294    1294    b7117_1 1||1290594||1291328||E4401_1220||1||CDS||41621516||1232||100.0 1232    1232    1.000000e-250  100
**DAGChainer output files contain type types of data lines:
***"\#": lines separate identified syntenic blocksThe information contained in this line is:
****#<syntenic block number> (NOTE: this is not a unique number:  each chromosome-chromosome (contig-contig) comparison has this number reset)
****Score for syntenic block
****Organism A chromosome in syntenic block  
****Organism B chromosome in syntenic block
****Orientation of syntenic block:  f == same orientation, positive slope on dotplot; r == opposite orientation, negative slope on dotplot  
****Number of syntenic gene pairs identified in syntenic block
***All other lines: contain a syntenic match
****Org A Chromosome
****Gene/region in org A
****Start position in org A (may be relative gene order or absolute nucleotide position)
****Stop position in org A (may be relative gene order or absolute nucleotide position)
****Org B Chromosome
****Gene/region in org B
****Start position in org B (may be relative gene order or absolute nucleotide position)
****Stop position in org B (may be relative gene order or absolute nucleotide position)
****Evalue of match
****Accumulative score of diagonal including this gene match (number increases with additional syntenic gene pairs within a syntenic block)


Please refer to the appropriate images for this discussion, or you can regenerate this analysis [http://synteny.cnr.berkeley.edu/CoGe/SynMap.pl?dsgid1=3068;dsgid2=8;D=20;g=10;A=5;w=0;b=1;ft1=1;ft2=1;dt=geneorder;ks=1;autogo=1 here.] This example shows a whole genome syntenic dotplot comparison of ''Arabidopsis thaliana'' (At) and ''Arabidopsis lyrata'' (Al)These taxa diverged from one another ~5MYA <ref name="lysak2006>Lysak MA, Berr A, Pecinka A, Schmidt R, McBreen K, Schubert I. Mechanisms of chromosome number reduction in Arabidopsis thaliana and related Brassicaceae species. Proceedings of the National Academy of Sciences of the United States of America. 2006 Mar 28;103(13):5224-5229.
*'''Merged DAGChainer output'''
</ref> and share two sequential whole genome duplications events <ref name=bowers2003>Bowers JE, Chapman BA, Rong JK, Paterson AH. Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature. 2003;422:433–438.</ref> since the divergence of their lineage with ''Carica papaya'''s lineage <ref name=ming2008>Ming R, Hou S, Feng Y, Yu Q, Dionne-Laporte A, Saw JH, Senin P, Wang W, Ly BV, Lewis KL, et al. The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus). Nature. 2008;452:991–996. doi: 10.1038/nature06856.</ref>Each whole genome duplication event creates a contemporaneous copy of every chromosome and all the genetic information they containHowever, over evolutionary time, these duplicated [[homeologous]] chromosomes are [[fractionated]], undergo [[rearrangement]] and [[inversion]]s, gene [[transposition]] events, and other genomic changesIn addition, duplicated genes that are retained (as well as surrounding non-coding sequence) will diverge from one anotherCoding sequence divergence can be measured by synonymous changes (Ks), and a population of contemporaneously created syntenic genes pairs from a whole genome duplication event will create characteristic peaks in a histogram of Ks values <ref name=blanc2004>Blanc, G., and K. H. Wolfe. 2004. Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes. Plant Cell 16:1667-1678</ref>.
**This file is generated if the analysis option is selected for merging neighboring syntenic regions
*'''Quota Alignment output'''
**This file is generated if the analysis option for [[Quota Align]] is selected for enforcing a specific syntenic depth relationship between the analyzed genomes
*'''DAGChainer output in genomic coordinates'''
**This file is the same as the DAGChainer output file but the genomic coordinates have been converted from gene order to nucleotide positions. 
*'''Results with synonymous/nonsynonymous rate values'''
**This file will be generated if synonymous or non-synonymous rates were calculated for syntenic gene pairsThis file is identical to the DAGChainer output file but also contains synonymous and non-synonymous rates in the first two positions of each syntenic gene pair line. Also contains a link to [[GEvo]] for comparing the genomic regions in which the syntenic gene pair resides at the end of the line
#This file contains synonymous rate values in the first two columns:
#1      133556.0        a4241_1 b7117_1 f      2673  Mean Ks:  0.1334  Mean Kn: 0.0258
#Ks    Kn      a<db_dataset_group_id>_<chr>   chr1||start1||stop1||name1||strand1||type1||db_feature_id1||percent_id1 start1  stop1 b<db_dataset_group_id>_<chr>   chr2||start2||stop2||name2||strand2||type2||db_feature_id2||percent_id2 start2  stop2  eval    ???    GEVO_link
0.0093  0.0000  a4241_1 1||1397989||1398285||ECDH10B_1366||-1||CDS||24909233||1293||99.7        1398285 1397989 b7117_1 1||1290089||1290385||E4401_1219||-1||CDS||41621513||1231||99.7  1290385 1290089 1.000000e-250  50 http://genomevolution.org/CoGe/GEvo.pl?fid1=24909233;fid2=41621513;dsgid1=4241;dsgid2=7117
  0.0000 0.0000 a4241_1 1||1398509||1399228||ECDH10B_1367||1||CDS||24909270||1294||100.0        1398509 1399228 b7117_1 1||1290594||1291328||E4401_1220||1||CDS||41621516||1232||100.0 1290594 1291328 1.000000e-250  100 http://genomevolution.org/CoGe/GEvo.pl?fid1=24909270;fid2=41621516;dsgid1=4241;dsgid2=7117


Shared whole genome duplication events can be detected through syntenic dotplot analysis (spacial analysis of gene order) and through synonymous change rate (Ks) histograms (temporal analysis of coding sequence divergence) for putative homologous gene pairs.  SynMap can combine these approaches and can identify collinear sets of putatively homologous genes (spatial detection of synteny), calculate Ks values for these syntelogous gene pairs (temporal calculation of synteny), and use those Ks values to generate a color-metric histogram and paint the syntelogs the appropriate color on the dotplot.  This combination of temporal and spatial syntenic analysis creates a final image that permits the rapid visual identification and evaluation of shared whole genome duplication events.
*'''Final syntenic gene-set output with GEvo links'''
**Same at the output file but also contains links to [[GEvo]] at the end of each line for pairs of syntenic genes.  This link will load GEvo with the gene pair loaded into two of the sequence submission boxes to the genomic regions in which they are contained may be examined.


Figure 1a shows a syntenic dotplot between the genomes of At (y-axis) and Al (x-axis) laid on each axis.  These plots are generated by comparing every coding sequence between these taxa using blastn in order to identify putatively homologous gene pairs.  These results are used by DAGChainer to find collinear sets of genes shared between the taxa.  The combined data-set is plotted according to their relative genomic position where each putative homologous gene pair is plotted with a gray dot, and syntenic gene pairs are plotted with a color based on their Ks value.  The comparison of At and Al's genomes shows two significant patterns of synteny.  First, these two genomes have syntenic regions identified by bright-green lines that are derived from the speciation divergence of these two taxaSocond, there are smaller blocks of yellow-green lines that are derived from their shared whole genome duplication event (WGD) know as alpha<ref name=bowers2003></ref>.  Comparison to the Ks histogram (Figure 1B) shows that the bright-green has smaller Ks values (fewer changes) than the yellow-green line, which is to be expected as their divergence post-dates their shared whole genome duplication event.   
*'''Condensed syntelog file with GEvo links'''
**This file assembled sets of syntenic genes so that if one gene in organism A is syntenic to multiple regions in organism B, those are listed togetherThis is useful if there are duplications in one genome (e.g. segmental or whole genome):
***Link to GEvo to analyse the syntenic gene set
***Link to GEvo formatted for a web-browserWhen this link is clicked, GEvo will automatically start running the analysis
***Each gene in the syntenic gene set separated by tabs


Generating a close-up view of the comparison of chromosome one from both taxa (Figure 2), reveals a similar pattern (light-blue orthologs, light-green out-paralogs derived from their shared most recent whole genome duplication event), with additional evidence of the more ancient shared whole genome duplication event known as beta<ref name=bowers2003></ref>. The beta whole genome duplication event is visualized by much smaller identified syntenic regions colored in yellow-orange. These syntenic gene-pairs correlate to a smaller peak in the Ks histogram with a larger mean Ks value than the subsequent whole genome duplication even (alpha) or the orthologs derived from the divergence of these taxa.
  http://genomevolution.org/CoGe/GEvo.pl?pad_gs=10000;fid1=24857462;dsgid1=4241;fid2=41617910;dsgid2=7117;fid3=41622794;dsgid3=7117;num_seqs=3    <a href=http://genomevolution.org/CoGe/GEvo.pl?pad_gs=10000;fid1=24857462;dsgid1=4241;fid2=41617910;dsgid2=7117;fid3=41622794;dsgid3=7117;num_seqs=3;autogo=1>AutoGo</a>       ECDH10B_0042    E4401_18        E4401_1646
http://genomevolution.org/CoGe/GEvo.pl?pad_gs=10000;fid1=24857502;dsgid1=4241;fid2=41617913;dsgid2=7117;fid3=41622797;dsgid3=7117;num_seqs=3    <a href=http://genomevolution.org/CoGe/GEvo.pl?pad_gs=10000;fid1=24857502;dsgid1=4241;fid2=41617913;dsgid2=7117;fid3=41622797;dsgid3=7117;num_seqs=3;autogo=1>AutoGo</a>        ECDH10B_0043    E4401_19        E4401_1647


In order to validate and access the types and patterns of change at these genomic loci, high-resolution analysis of these syntenic regions can be performed using GEvo, and selecting the appropriate set of genomic regions using SynMap's interface.  Such an analysis can be seen in figure 3 which compares five syntenic regions from these taxa.  Two pairs of regions, At1-Al1 and At2-Al2, are orthologous and derived from the speciation of these lineages.  This is evidenced by the high degree of spatial evidence for synteny between these regions (pink and blue lines) where nearly every gene in these regions has an orthologous partner in a collinear arrangement.  These two pairs of regions are also syntenic with respect to one another (green lines) and are derived from their shared most recent whole genome duplication event (WGD) known as alpha.  These four regions are syntenic to an additional region, Al3, which is derived from these lineages' shared second most recent whole genome duplication event known as beta.  Syntenic genes are connected between Al2 and Al3 using dark blue lines, and note the lower density of syntenic gene pairs than for regions derived from the most recent WGD and the speciation of the lineages.  While not shown in this figure, there is a syntenic region in At to Al3 from the speciation of these taxa, and two addition syntenic regions (one from each lineage) derived from the alpha whole genome duplication.


Please note that the Ks histogram is using log10 transformed Ks values. While many people set an upper cutoff for Ks values (usually at 2), these histograms show all values.  The peak in the both these Ks histograms (Mean Ks ~ 65) at the far right and colored red is the result from mis-called syntenic gene pairs, genes whose alignments are very poor (e.g. due to a frame-shift mutation or pseudogenization), or from an error in the Ks calculation. 
[[Image: SynMap Links and Downloads.png|thumb|right|500px]]


==Regenerating an Analysis==
After a SynMap analysis completes, SynMap will return a link that will reload SynMap with the exact configuration of your analysis.  This permits using that link to save in your notes, send to a collaborator, or publish.  You can find this link under the "Links and Downloads" section of the results.  It is a green button labeled "Regenerate this analysis: http://genomevolution.org/r/2o29".  To copy this link, just right click on the button and choose to "copy link location".  Otherwise, the shortened link will be sent to a URL detransmogifier, and will be expanded to a large link.


==Example Results==
===[[Syntenic comparison of Arabidopsis thaliana and Arabidopsis lyrata]]===
===[[Maize_Sorghum_Syntenic_dotplot | Maize and Sorghum]]===
===[[Analysis_of_variations_found_in_genomes_of_Escherichia_coli_strain_K12_DH10B_and_strain_B_REL606_using_SynMap_and_GEvo_analysis | Strains of Escherichia coli K12]]===
===[[X-alignments]] of various bacterial genomes===
===Human-Chimp Whole Genome Comparison (Video)===
<youtube>I-dUsMuIkMg</youtube>


{{reflist}}
==FAQ==
===How do I sort the chromosomes by name instead of by size?===
#Go to the "Display Options" tab
#select "Name" from the menu for "Sort Chromosomes by:"
[[Image:Screen shot 2011-02-26 at 7.29.40 PM.png]]

Latest revision as of 00:05, 27 January 2017

Syntenic dotplot generated by SynMap between two substrains of Escherichia coli K12, DH10B and W3110. Results can be regenerated at https://genomevolution.org/r/dfjw

Overview

SynMap allows you to generate a syntenic dotplot between two organisms and identify syntenic regions. This is done by:

  1. Finding putative genes or regions of homology between two genomes
  2. Identifying collinear sets of genes or regions of sequence similarity to infer synteny
  3. Generating a dotplot of the results and coloring syntenic pairs.

If you choose, synonymous and non-synonymous site mutation data can be calculated for protein coding genes that are identified as syntenic. These genes will then be colored based on those values in the dotplot for rapid identification of different age-classes of syntenic regions.

SynMap Methods

  1. Extract sequences for comparison; build fasta files
  2. Create blastable databases (if necessary) and compare using
    1. Last, a variant of Blast, much faster than lastz, nearly as sensitive. Last Homepage: http://last.cbrc.jp/
    2. lastz, a variant of blastz [1]
    3. MegaBlast
    4. Discontinuous MegaBlast
    5. TblastX
    6. BlastP
    7. BlastN [2]. SynMap uses default blast parameters with an e-value cutoff of 0.001.
  3. Identifies tandem gene duplicates using a program written by Haibao Tang and Brent Pedersen called blast2raw. Tandem gene duplicates are then condensed and treated as a single gene. These can later be output and downloaded in a file to be investigated further.
  4. Filters repetitive matches, e.g. high copy number genes, using a program written for SynMap by Brent Pedersen
  5. Identify syntenic pairs of by finding collinear series of putative homologous sequences using DAGChainer [3]
  6. Optional: calculate synonymous and nonsynonymous mutation rates for syntenic gene pairs using CodeML of the PAML package[4]
  7. Generate dotplot of all putative homologous matches; dots are colored gray
  8. Add syntenic pairs; dots are colored either green and red (same or opposite orientation), or based on the synonymous mutations, nonsynonymous mutations, or the ratio of nonsynonymous/synonymous mutations
  1. Schwartz, S. et al. Human-mouse alignments with BLASTZ. Genome Res. 13, 103−107 (2003)
  2. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. Journal of Molecular Biology 215:403-410
  3. Haas BJ, Delcher AL, Wortman JR, Salzberg SL (2004) DAGchainer: a tool for mining segmental genome duplications and synteny. Bioinformatics 20: 3643–3646
  4. Yang Z (2007) PAML 4: Phylogenetic Analysis by Maximum Likelihood. Molecular Biology and Evolution 24:1586-1591

Specifying genomes

It is easy to search and select organisms in SynMap. Just select the "select organisms" tab and search for two organisms by either name or Organism description. Just type part of either in the text box and SynMap automatically searches for any organism that contains your text. The results are displayed below the search boxes. Some organisms may have multiple versions of their genome and different types of sequences (e.g. masked versus unmasked). These will be displayed in a drop-down menu from which you can select the correct genome. Also, this is where you can select for comparing CDS sequence or genomic sequence. If the genome does not have CDS features, the option won't be displayed and a warning will be printed in below these drop-down menus. When selected, a brief description of the genome will be displayed below the drop-down menus. This will include the full organism name and description, followed by an overview of the genome. If you click on "Genome Information:" it will link to OrganismView and give you a full description of the genome. Otherwise, it will display source of the genome, the number of chromosomes in the genome, the total length of the genome in nucleotides, and whether the genome contains plasmids or Contigs.

QuotaAlign options

QuotaAlign is a post-processing step to merge adjacent syntenic blocks, or select subset of syntenic blocks that reflect matching ratio of regions (for example, number of subgenomes in duplicated genomes). These two steps are essential for downstream analysis of genome rearrangements. Read more details on QuotaAlign.

  • NOTE: If you select a Quota Ratio of 1:1, a link will appear in the "Links and Downloads" section of the results called "Rearrangement Analysis (powered by GRIMM!)" This link will take you to the GRIMM: Genome Rearrangement Analysis in Man and Mouse server and automatically populate their submission box with your genomes from SynMap appropriate coded for rearrangement analysis. Please note that their algorithm will work on any pairs of genomes with a 1:1 syntenic mapping.

SynMap Options

Analysis Options

  • Blast Algorithm:
    • (B)LastZ: Nucleotide-nucleotide search. This is the LastZ implementation that has been parallelized to break up the query sequences into multiple pieces for searching. This is usually the best algorithm to pick in terms of sensitivity and speed.
    • MegaBlast: Nucleotide-nucleotide search.
    • Discontinuous Megablast: Nucleotide-nucleotide search
    • BlastN: Nucleotide-Nucleotide search
    • BlastP: Protein-Portein search
    • TblastX: Translated Nucleotide-Translated Nucleotide search
      • tblastx takes 6 times longer than blastn and usually doesn't improve the ability to find synteny. If the DNA sequence is that diverged at the nucleotide level that a protein sequence search is needed to find putative homologous sequence, usually the genome structure is also very divergent and collinear gene sets are not likely to be found.
  • Filter Repetitive Matches: This option adjust the e-values of the blast hits to lower the significance of sequences that occur multiple times in a genome
  • DAGChainer options:
    • DAGChainer is the algorithm to identify syntenic regions between genomes. It works by searching some distance from a pair of genes for another pair. If some threshold number of gene pairs are identified, DAGChainer keeps that set of gene-pairs which will be reported back to SynMap. These sets of gene pairs are interpreted as two syntenic regions, and get colored in the dotplot for easy identification.
    • "Relative Gene Order" and "Nucleotide Distance".
      • This determines whether DAGChainer is searching by number of genes (or regions of sequence similarity) or by absolute genomic nucleotide distances. For the most part, using "relative gene order" is preferable. The absolute distance between genes in nucleotides varies widely between genomes, and even within a genome. For the former, gene spacing in a bacteria genome are orders of magnitude closer than for an animal genome. For the latter, genes near centromeres are often further from one another than genes distal to a centromere. This is very apparent in plant genomes.
    • Average distance expected between syntenic genes
    • Maximum distance between two matches
      • These two options are self explanatory. One thing to keep in mind is that the larger these values are set, the more generous DAGChainer is in including genes in a collinear set. In other words, as these values are increased, your false positive rate goes up. Also, the "average distance" should be lower than "maximum distance".
    • Minimum number of aligned pairs
      • This is the minimum number of gene-pairs that DAGChainer needs in a collinear gene set to keep. The higher this number, the more stringent DAGChainer will be for finding "good" collinear gene sets.
  • Merge Syntenic Blocks: Merges neighboring syntenic blocks into larger blocks. Useful if you are performing statistics on syntenic blocks.
    • Quota Align Merge: Use Haibao Tang's Quota Align method to merge syntenic blocks (recommended). Merge distance is the number of genes or nucleotides to use to link a neighboring syntenic region
    • Iterative DAGChainer: Recode the blocks into a new input file for DAGChainer and run DAGChainer again
  • Syntenic Depth with Quota align:
    • This algorithm removes syntenic regions by enforcing a defined syntenic relationship between genomes. For example, a 1:1 syntenic depth will find a keep the best syntenic regions that cover each genome once. A 1:2 syntenic depth will find the best syntenic regions that cover one genome once and the other genome twice. This option is useful if you want to minimize noise in an analysis with false positive syntenic regions, are testing for multiple polyploidy events, or trying to mask older polyploidy events from an analysis.
    • Overlap Distance: How much syntenic regions may overlap and still be permitted. Some overlap is recommended as the algorithms for finding syntenic regions may have fuzzy ends that are have some overlap with one another.
    • Tutorial on Genomic Rearrangement Analysis using SynMap and GRIMM.
  • FractBias:
    • This algorithm calculates and plots fractionation bias between the selected genomes. In order to run, Syntentic Depth must be set to specify which genome is used as the reference (syntenic depth 1) and which genome will be assessed for fractionation bias (syntenic depth >1).
    • For details on this tool, please see: FractBias
  • CodeML: Turn on this option to calculate synonymous and non-synonymous mutation rates for identified syntenic gene pairs. The syntenic gene pairs get colored according to these values and a histogram with those color-metrics is displayed below the syntenic dotplot.

Advanced Options

  • Tandem duplication distance: This is the distance searched (in genes) when identifying tandem gene duplicates. A value of 10 means that if there are genes within 10 genes of one another that both match the same gene in the other genome, they will be considered tandem duplicates and condensed to a single representative. This option is used by blast_to_raw.py
  • C-score: This option filters low quality blast hits using a cutoff value between 0 and 1. From the documentation of blast_to_raw.py:
see supplementary info for sea anemone genome paper <http://www.sciencemag.org/cgi/content/abstract/317/5834/86>
formula below:
  cscore(A,B) = score(A,B) / max(best score for A, best score for B)
    • If a c-score value of 0.5 is used, this means that any hit to a given gene will be filtered if its score is less than 50% of that gene's best score.

Display Options

  • Regenerate dotplot images: Select this if you want the dotplot image regenerated. This are saved and can be out of date with regards to the methods used to generate them, or something bad happened during their original generation
  • Show non-syntenic matches (grey dots): Select this option if you wish to see all the putative homologous matches on the dotplot drawn as grey dots. Othewise, only syntenic matches are shown. This option may increase the time it takes to generate the dotplot as the additional dots are identified and drawn.
  • Draw boxes around syntenic regions? : This option draws boxes around identified syntenic blocks. Using this option is useful when determining whether to merge neighboring syntenic blocks and how much overlap neighboring blocks may have during merging.
  • Sort Chromosomes by: Are chromosomes or contigs on the axes of the syntenic dotplot ordered by size or name.
  • Flip axes: Do you want to exchange the x and y axes?
  • Color diagonals by: By default, syntenic dots (diagonals) are colored green. You can select to color them by:
    • Inversion: positive sloped lines are green, negative sloped lines are blue
    • Diagonals: Each syntenic block gets a different color (actually, cycles through 6 colors). This is a method for visualizing the identified syntenic blocks. Try this and the "Draw boxes around syntenic regions".
  • Dotplot axis metric: Are the dotplot axis to be measured in nucleotide or gene units. Nucleotides accurately reflect the structure of the genome. Gene units help make collinear lines straighter if there are regions of the genome undergoing different rates of expansions (e.g. centromeres).
  • Master image width: How wide, in pixels, will the master dotplot image be? Dynamic scales it according to the size of the genome, but sometimes it is necessary to make them very large to see some details.
  • Calculate substitute rates for syntenic protein coding gene pairs and color syntenic dots accordingly
    • This is the option to set if you want to calculate synonymous, nonsynonymous, or the ratio of those and color syntenic gene pair dots based on those values. This is very helpful if there are whole genome duplication events in one or both genomes. When these values are generated, a histogram of the values will also be generated and displayed. The histogram will be color coded identically to the dots in the dotplot. This makes it easy to determine which sytnenic gene pair dots are older or newer than other syntenic dots. For an example, please see below. PLEASE NOTE: generating these values takes a long time. It takes a very long time for large genomes.
  • Order contigs by best syntenic path: this option will invoke a special algorithm to rearrange contigs in order to create the best order of contigs that make a continous syntenic path in relation to a reference genome. This is very useful for a genome sequences that have been assembled to contigs for which a reference genome is available. PLEASE NOTE: that this ordering of contigs may not be how the genome is actually structured. Some genomic changes, such as large-scale inversions will not be correctly placed if the contigs' break-points are at the end of the inversion. This is often the case as inversions happen in repetitive sequence, and these same repetitive sequence are often where genome assembly algorithms cannot assemble through. For an example, see Syntenic path assembly.

Calculating and displaying synonymous/non-synonymous (Ks, Kn) data

This option is selected under the "SynMap Options" tab To select, just set the option "Calculate substitution rates for syntenic protein coding gene pairs and color syntenic dots accordingly" to what rate you desire:

  • Ks -- synonymous mutation rate
  • Kn -- non-synonymous mutation rate (also known as Ka)
  • Kn/Ks -- ratio of non-synonymous to synonymous changes. This is often used to detected neutral (Kn/Ks == 1), positive (Kn/Ks > 1), and purifying (Kn/Ks < 1) selection acting on a pair of genes.

Synonymous (Ks) and non-synonymous (Kn) site changes are calculated by:

  1. Performing a global sequence alignment of the protein sequences using the using the Needleman-Wunsch algorithm implemented in nwalign and written and maintained by Brent Pedersen using the BLOSOM62 scoring matrix.
  2. Back translating the protein sequence into aligned codons
  3. Using codeml of the PAML software package written and maintained by Ziheng Yang. We modified codeml for implementation in SynMap in order to minimize the number of read/write cycles to the hard drives, as well as allow it to be more easily run in parallel on multi-core servers. For each pairwise comparison of aligned codon sequences, codeml is run 5 times using its default parameter sets, and the lowest Ks is kept.

Running SynMap

Just press "Generate SynMap" when everything is configured. While SynMap runs, messages will be displayed as to the stage of the analysis that is being processed.

Interacting with results

Main results

  1. Cross-hairs help visually align syntenic region from different parts of the genome.
  2. Clicking on a chromosome panel in the initial results creates a zoomed-in panel.

Zoomed-in results

  1. Cross-hairs turn red when the mouse is over a gene pair that is a link to GEvo with that pair pre-loaded. GEvo can then be used to perform a high-resolution analysis of those genomic regions.

Output Files

SynMap provides links to all the output files generated during its analysis. Most of the files are located in a box below the dotpot named "Links and Downloads".

Log File

  • Analysis Log (id: SynMap_52205333):
    • This file contains a list of every computational action taking throughout SynMap's analytical pipeline. This includes the exact command line used to run each program. This file is very useful for reproducing an analysis yourself, and figuring out errors in the pipeline should they occur.
#################### 
Runing genome comparison
Running (B)lastZ
running:
	/usr/bin/python /opt/apache/CoGe/bin/blastz_wrapper/blastz.py -A 32 --path=/usr/bin/lastz -i /opt/apache/CoGe/data/fasta//7118-CDS.fasta -d /opt/apache/CoGe/data/fasta//4243-CDS.fasta -o /opt/apache/CoGe/diags//Escherichia_coli_K12_strain_K-12_substrain_NCM4781/Escherichia_coli_K12_strain_K-12_substrain_W3110/7118_4243.CDS-CDS.lastz
blastfile /opt/apache/CoGe/diags//Escherichia_coli_K12_strain_K-12_substrain_NCM4781/Escherichia_coli_K12_strain_K-12_substrain_W3110/7118_4243.CDS-CDS.lastz already exists
Completed blast run(s)
####################

####################
Creating .bed files
Creating bed files: /opt/apache/CoGe/bin/SynMap/blast2bed.pl -infile /opt/apache/CoGe/diags//Escherichia_coli_K12_strain_K-12_substrain_NCM4781/Escherichia_coli_K12_strain_K-12_substrain_W3110/7118_4243.CDS-CDS.lastz -outfile1 /opt/apache/CoGe/diags//Escherichia_coli_K12_strain_K-12_substrain_NCM4781/Escherichia_coli_K12_strain_K-12_substrain_W3110/7118_4243.CDS-CDS.lastz.q.bed -outfile2 /opt/apache/CoGe/diags//Escherichia_coli_K12_strain_K-12_substrain_NCM4781/Escherichia_coli_K12_strain_K-12_substrain_W3110/7118_4243.CDS-CDS.lastz.s.bed
####################

Homolog search

These are the data files that are used and processed to identify putative homologous genes between the selected genomes

  • Fasta file for Escherichia coli K12 strain K-12 substrain DH10B (v1): CDS
  • Fasta file for Escherichia coli K12 strain K-12 substrain MG1655 (v2): CDS
    • These are the fasta files used in the whole genome comparison. May be CDS or genomic sequences.
  • Unfiltered (B)lastZ results
    • This is the output from the whole genome comparison. The output will depend on the sequence comparison algorithm selected. This output is in tab-delimited format and is the raw blast output. The headers of the columns for the output file are: query ID(chr|start|stop|name|strand|feat type|database id|gene order)<tab>subject id (chr|start|stop|name|strand|feat type|database id|gene order)<tab>Percent identity<tab>Alignment Length<tab>Mismatch count<tab>gap open count<tab>query start<tab>query end<tab>Subject start<tab>Subject end<tab>eValue<tab>Bit Score.
  • Filtered (B)lastZ results (no tandem duplicates)
  • Tandem Duplicates for Escherichia coli K12 strain K-12 substrain DH10B (v1)
  • Tandem Duplicates for Escherichia coli K12 strain K-12 substrain MG1655 (v2)
    • Tandem duplicates are removed from the whole genome comparison. The filtered blast file is sent to DAGChainer, and the tandem duplicates are saved for each genome. Each line of the tandem duplicate file contains one set of tandem duplicates and links to CoGe to extract those sequences. The format for the tandem duplicate files is as follows:
#FeatList_link  GEvo_link       FastaView_link  chr||start||stop||name||strand||type||database_id||gene_order
http://genomevolution.org/CoGe/FeatList.pl?fid=41633483;fid=41633480;   http://genomevolution.org/CoGe/GEvo.pl?fid1=41633483;fid2=41633480;num_seqs=2   http://genomevolution.org/CoGe/FastaView.pl?fid=41633483;fid=41633480;  1||1009879||1010469||E4781_964||-1||CDS||41633480||974  1||1010569||1010712||E4781_965||-1||CDS||41633483||975
    • FeatList_link: Generate a feature lits of the tandem duplicates (this tool links to many other tools in CoGe for analyzing and processing lists of genomic features)
    • GEvo_link: Compare the tandem duplicates and their genomic region
    • FastaList_link: Generate fasta sequences of the tandem duplicates
    • Each tandem duplicate the follows these links and contain the following information delimited by "||"
      • Chromosome
      • Start position
      • Stop position
      • Strand
      • Genomic feature type (CDS, tRNA, rRNA)
      • CoGe database id (useful for constructing links to CoGe's various tools
      • Order within the genome

Diagonals

These are the files that will be sent into DAGChainer to identify syntenic regions and gene pairs. Different files will be present depending on the options used in the analysis:

  • DAGChainer Initial Input file
    • The whole genome comparison file (post tandem gene removal) converted into the input type for DAGChainer
  • DAGChainer Input file converted to gene order
    • If relative gene order is used to specify gene positions within a genome, this file will be generated. The chromosomal start/stop positions will be replaced with the gene position along a chromosome. For example if the position is 1123, that feature is the 1123rd feature along that chromosome.
  • DAGChainer Input file post repetivive matches filter
    • This file has the e-values adjusted such that genomic features which match multiple regions in the genome get higher (less significant) e-values. This is the "Filter repetitive matches" in the "Analysis Options" tab.

Example DAGChainer input line:

a7118_1 1||2588568||2593148||E4781_2508||-1||CDS||41638112||2528||100.0 2528    2528    b4243_1 1||2776802||2781382||AP_003226.1||-1||CDS||24957549||2612||100.0        2612    2612    0
  • <org a chromosome>: This is coded with an "a" or "b", followed by the coge database id for the genome, followed by an "_", followed by the chromosome name. This is create a unique "name" for each chromosome in DAGChainer even if the same genome is compared against itself
  • <org a gene name>: This is a "||" delimited format that contains the following information:
    • Chromosome
    • Start position
    • Stop position
    • Gene name
    • Strand
    • Genomic feature type (CDS, tRNA, rRNA)
    • CoGe database id (useful for constructing links to CoGe's various tools
    • Order within the genome
    • Blast hit identity of pair
  • <org a gene start position>
  • <org a gene stop position>: Note: will be same as start if relative gene order on chromosome is used (as shown in the example above)
  • <org b chromosome>
  • <org b gene name>
  • <org b gene start position>
  • <org b gene stop position>
  • <blast e-value>

Results

  • DAGChainer output
    • Output generated by DAGChainer.
#1      133556.0        a4241_1 b7117_1 f       2673
a4241_1 1||1397989||1398285||ECDH10B_1366||-1||CDS||24909233||1293||99.7        1293    1293    b7117_1 1||1290089||1290385||E4401_1219||-1||CDS||41621513||1231||99.7  1231    1231    1.000000e-250   50
a4241_1 1||1398509||1399228||ECDH10B_1367||1||CDS||24909270||1294||100.0        1294    1294    b7117_1 1||1290594||1291328||E4401_1220||1||CDS||41621516||1232||100.0  1232    1232    1.000000e-250   100
    • DAGChainer output files contain type types of data lines:
      • "\#": lines separate identified syntenic blocks. The information contained in this line is:
          1. <syntenic block number> (NOTE: this is not a unique number: each chromosome-chromosome (contig-contig) comparison has this number reset)
        • Score for syntenic block
        • Organism A chromosome in syntenic block
        • Organism B chromosome in syntenic block
        • Orientation of syntenic block: f == same orientation, positive slope on dotplot; r == opposite orientation, negative slope on dotplot
        • Number of syntenic gene pairs identified in syntenic block
      • All other lines: contain a syntenic match
        • Org A Chromosome
        • Gene/region in org A
        • Start position in org A (may be relative gene order or absolute nucleotide position)
        • Stop position in org A (may be relative gene order or absolute nucleotide position)
        • Org B Chromosome
        • Gene/region in org B
        • Start position in org B (may be relative gene order or absolute nucleotide position)
        • Stop position in org B (may be relative gene order or absolute nucleotide position)
        • Evalue of match
        • Accumulative score of diagonal including this gene match (number increases with additional syntenic gene pairs within a syntenic block)
  • Merged DAGChainer output
    • This file is generated if the analysis option is selected for merging neighboring syntenic regions
  • Quota Alignment output
    • This file is generated if the analysis option for Quota Align is selected for enforcing a specific syntenic depth relationship between the analyzed genomes
  • DAGChainer output in genomic coordinates
    • This file is the same as the DAGChainer output file but the genomic coordinates have been converted from gene order to nucleotide positions.
  • Results with synonymous/nonsynonymous rate values
    • This file will be generated if synonymous or non-synonymous rates were calculated for syntenic gene pairs. This file is identical to the DAGChainer output file but also contains synonymous and non-synonymous rates in the first two positions of each syntenic gene pair line. Also contains a link to GEvo for comparing the genomic regions in which the syntenic gene pair resides at the end of the line
#This file contains synonymous rate values in the first two columns:
#1      133556.0        a4241_1 b7117_1 f       2673  Mean Ks:  0.1334  Mean Kn: 0.0258
#Ks     Kn      a<db_dataset_group_id>_<chr>    chr1||start1||stop1||name1||strand1||type1||db_feature_id1||percent_id1 start1  stop1 b<db_dataset_group_id>_<chr>    chr2||start2||stop2||name2||strand2||type2||db_feature_id2||percent_id2 start2  stop2   eval    ???    GEVO_link
0.0093  0.0000  a4241_1 1||1397989||1398285||ECDH10B_1366||-1||CDS||24909233||1293||99.7        1398285 1397989 b7117_1 1||1290089||1290385||E4401_1219||-1||CDS||41621513||1231||99.7  1290385 1290089 1.000000e-250   50 http://genomevolution.org/CoGe/GEvo.pl?fid1=24909233;fid2=41621513;dsgid1=4241;dsgid2=7117
0.0000  0.0000  a4241_1 1||1398509||1399228||ECDH10B_1367||1||CDS||24909270||1294||100.0        1398509 1399228 b7117_1 1||1290594||1291328||E4401_1220||1||CDS||41621516||1232||100.0  1290594 1291328 1.000000e-250   100 http://genomevolution.org/CoGe/GEvo.pl?fid1=24909270;fid2=41621516;dsgid1=4241;dsgid2=7117
  • Final syntenic gene-set output with GEvo links
    • Same at the output file but also contains links to GEvo at the end of each line for pairs of syntenic genes. This link will load GEvo with the gene pair loaded into two of the sequence submission boxes to the genomic regions in which they are contained may be examined.
  • Condensed syntelog file with GEvo links
    • This file assembled sets of syntenic genes so that if one gene in organism A is syntenic to multiple regions in organism B, those are listed together. This is useful if there are duplications in one genome (e.g. segmental or whole genome):
      • Link to GEvo to analyse the syntenic gene set
      • Link to GEvo formatted for a web-browser. When this link is clicked, GEvo will automatically start running the analysis
      • Each gene in the syntenic gene set separated by tabs
http://genomevolution.org/CoGe/GEvo.pl?pad_gs=10000;fid1=24857462;dsgid1=4241;fid2=41617910;dsgid2=7117;fid3=41622794;dsgid3=7117;num_seqs=3    <a href=http://genomevolution.org/CoGe/GEvo.pl?pad_gs=10000;fid1=24857462;dsgid1=4241;fid2=41617910;dsgid2=7117;fid3=41622794;dsgid3=7117;num_seqs=3;autogo=1>AutoGo</a>        ECDH10B_0042    E4401_18        E4401_1646
http://genomevolution.org/CoGe/GEvo.pl?pad_gs=10000;fid1=24857502;dsgid1=4241;fid2=41617913;dsgid2=7117;fid3=41622797;dsgid3=7117;num_seqs=3    <a href=http://genomevolution.org/CoGe/GEvo.pl?pad_gs=10000;fid1=24857502;dsgid1=4241;fid2=41617913;dsgid2=7117;fid3=41622797;dsgid3=7117;num_seqs=3;autogo=1>AutoGo</a>        ECDH10B_0043    E4401_19        E4401_1647


Regenerating an Analysis

After a SynMap analysis completes, SynMap will return a link that will reload SynMap with the exact configuration of your analysis. This permits using that link to save in your notes, send to a collaborator, or publish. You can find this link under the "Links and Downloads" section of the results. It is a green button labeled "Regenerate this analysis: http://genomevolution.org/r/2o29". To copy this link, just right click on the button and choose to "copy link location". Otherwise, the shortened link will be sent to a URL detransmogifier, and will be expanded to a large link.

Example Results

Syntenic comparison of Arabidopsis thaliana and Arabidopsis lyrata

Maize and Sorghum

Strains of Escherichia coli K12

X-alignments of various bacterial genomes

Human-Chimp Whole Genome Comparison (Video)

FAQ

How do I sort the chromosomes by name instead of by size?

  1. Go to the "Display Options" tab
  2. select "Name" from the menu for "Sort Chromosomes by:"