In this example, CoGe will be used to perform whole genome syntenic analysis of the genomes of maize and sorghum in order to:
When comparing the genomes of maize and sorghum, there are three genomic evolutionary events that need to be considered. Figure 1 shows these events and listed in chronological order are:
Each one of these events creates a copy of the genome, and these events can be seen in a syntenic dotplot between these genomes.
A whole genome syntenic dotplot takes two genomes and lays them out end-to-end along each axis. In Figure 2, the sorghum genome is on the x-axis, and the maize genome is on the y-axis. Each black vertical and horizontal line delineates a chromosome. Each gene from those genomes are compared to one another and a dot is drawn at the appropriate x-y coordinate if two genes are similar in sequence. Genes with similar DNA sequence are putative homologs. These results are then fed into an algorithm to find collinear series of genes. If two genomic regions are related to one another through common descent from the same ancestral genomic region, then they will maintain a collinear arrangement of genes from that ancestor. While genomes can change, genes can move to new genomic positions, and duplicate genes lost, this pattern of collinear gene arrangement will be discernible for long evolutionary time periods and can be used to infer that two genomic regions are related through common ancestry (synteny). When such collinear arrangements are detected in this syntenic dotplot, those dots get colored. We call pairs of genes in a collinear arrangement syntenic gene pairs, or syntelogs.
Since the whole genome duplication and lineage divergence events happened at different times in the history and evolution of maize and sorghum's lineages, the gene-pairs derived from those events are also of different ages. One way to measure the relative age of a pair of related genes is by estimating their rates of synonymous mutations. Genes that are more closely related usually have fewer synonymous changes than genes that are more distantly related. The rate of synonymous change has been measured for each pair of syntelogs identified in the maize-sorghum syntenic dotplot, and colored such that younger syntelogs (lower number of synonymous changes) are colored red, and older syntelogs (higher number of synonymous changes) are colored purple. Looking at the syntenic dotplot, it is now easy to identify red, younger sytnenic regions and purple, older syntenic regions.
Looking closely at the syntenic dotplot, there is an overlap of these colored lines when the lines are projected to one axis or the other. This is because a given region of one genome is syntenic to multiple regions in the other genome. Based on the series of events listed above, it is expected that for every region of the sorghum genome, there will be two red lines in maize because maize has had a whole genome duplication event after these lineages diverged. On the other hand, for each region of the maize genome, there will only be one red line in sorghum.
Understanding the purple lines is a bit more complicated. These syntenic regions are derived from the older shared whole genome duplication event. As seen with the red lines, for a given region of sorghum, there are two purple lines that come from maize's most recent whole genome duplication, and for a given region of maize, there will be a single purple line in sorghum.
All together, this means that there is a 2:4 syntenic relationship between sorghum and maize. There are two in sorghum form the pre-grass whole genome duplication event, and there are four in maize from the pre-grass whole genome duplication event combined with the subsequent maize-specific whole genome duplication event. This means that for any genomic region in maize or sorghum, there are a total of 5 other syntenic regions. This gives rise for the possibility of comparing 6 syntenic regions at once: 2 from sorghum and 4 from maize.
Another way to see these patterns is through high-resolution analysis of syntenic regions use GEvo. If SynMap is used to create and visualize syntenic dotplots, the results are interactive and provide links to GEvo. Figure 3 shows an example 6-way comparison of syntenic regions from maize sorghum dating back to the pre-grass whole genome duplication event. Each panel of the figure represents one genomic region. In this figure, the two sorghum regions derived from the pre-grass whole genome duplication event are the middle two panels, with two maize syntenic regions located above or below each sorghum region. These pairs of maize regions are derived from the maize-specific whole genome duplication event, the pairs of maize regions are orthologous to the closest sorghum region (derived from the divergence of their lineages), and the two sorghum regions are paralogous (or homeologous) to each other (derived from the pre-grass whole genome duplication event).
In figure 3, pairwise comparisons of these regions have been performed in order to identify similar protein coding DNA sequences. For several comparisons, colored lines have been drawn connecting regions of sequence similarity. It is apparent that these lines have a collinear arrangement, and is evidence that the regions are syntenic. However, notice how there are different densities of lines for different comparisons. Each sorghum region has lines drawn connecting it to its two orthologous maize regions, and to the other sorghum region. When comparing the pair of sorghum regions, not all of the genes are shared. This is due to a process known as fractionation. Following a whole genome duplication event, many duplicated genes are lost from one homeologous region or its partner region over evolutionary time.
Fractionation is also seen between pairs of maize regions derived from the maize-specific whole genome duplication event. Figure 4 shows a high-resolution analysis using GEvo of a sorghum region to its two syntenic orthologous regions from maize. While a given sorghum region has nearly its entire gene content represented in its two orthologous maize regions, some genes are represented only in one of the two regions. This is due to genes duplicated in the maize-specific whole genome duplication event being lost from home homeologous region or the other.
Using the previously generated results
Once syntenic gene sets are identified, these can be analyzed for conserved non-coding sequences (CNSs). CNSs are DNA sequences, usually neighboring genes, that do not code for protein and are nonetheless conserved over evolutionary time similarly to protein coding sequences. Because of their sequence conservation, CNSs are thought to have a biological function for which selection is preserving the DNA sequence. An example of such a function are transcription factor binding sites.
Figure 5 shows a high-resolution analysis using GEvo to identify CNSs. From this figure, there are several regions of sequence similarity that are collinear with respect to neighboring genes and do not overlap protein coding regions (identified as green arrows in these panels.) Interestingly, some CNSs are retained in both regions of maize.
Using the previously generated results
This example shows how comparing genomes at a variety of levels are used to identify specific genomic patterns.
The syntenic data for maize and sorghum is available here.
The classical maize gene list containing ~460 of most widely studied maize genes, their location in the current genome, and linked to MaizeGDB and their syntenic regions in maize, sorghum, brachypodium, and rice.