Maize Sorghum Syntenic dotplot

From CoGepedia
Jump to: navigation, search
Figure 1. Genomic evolutionary relationships between sorghum and maize; red stars indicate whole genome duplication events. THe maize-sorghum (grass) lineage has an paleo-whole genome duplication. Subsequent to their divergence, maize had an additional whole genome duplication event.
Figure 2. Syntenic dotplot with Ks coloration of sorghum (x-axis) versus maize (y-axis). Genes are used for axis metrics; black lines separate chromosomes in each genome. Results can be regenerated at: http://genomevolution.org/r/bay. Red syntenic lines are from the maize-specific whole genome duplication event and are orthologous to sorghum. Purple are from the older pre-grass whole genome duplication event are are out-paralogs
Histograms of synonymous mutation values between syntenic genes identified between maize and sorghum.
Figure 3. High-resolution sequence comparison of syntenic regions from maize and sorghum, including the pre-grass whole genome duplication event, the divergence of their lineages, and the maize-specific whole genome duplication event. These events create six syntenic regions identified each by a panel, 4 from maize (top two panels and bottom two panels) and 2 from sorghum (middle two panels). Results can be regenerated at: https://genomevolution.org/r/dfk4
Figure 4. High-resolution sequence comparison of a genomic region from sorghum and its two orthologous syntenic regions from maize. The sorghum region is the middle panel. Notice the strong pattern of collinear homologous genes between sorghum and both maize regions, while the maize regions only share two homologous genes with each other. This is due to the process of fractionation following the maize-specific whole genome duplication events. Results can be regenerated at: https://genomevolution.org/r/dfk5
Figure 5. High-resolution sequence comparison of a genomic region from sorghum and its two syntenic orthologous regions from maize. Blastn was used to identify regions of similar sequence in order to identify conserved non-coding sequences. These are DNA sequences that don't code for RNA or protein, usually found near genes, and are conserved to a high degree over long periods of evolutionary time. Such conservation is indicative of selection working to preserve the original sequence, and hence a function for these sequences. One such likely function are transcription factor binding sites. Results can be regenerated at: https://genomevolution.org/r/dfk6


Announcement

For researchers coming from MaizeGDB, please see this page for detailed step-by-step instructions on using CoGe by way of MaizeGDB.

Background

In this example, CoGe will be used to perform whole genome syntenic analysis of the genomes of maize and sorghum in order to:

  1. Generate whole genome syntenic dotplots and identify syntenic regions and gene sets
  2. Color gene-pairs in syntenic regions according to the evolutionary distance and classify syntenic regions to specific evolutionary events
  3. Perform high-resolution analysis of syntenic regions to identify:
    1. patterns of fractionation
    2. conserved non-coding sequences

Genome evolution of maize and sorghum

When comparing the genomes of maize and sorghum, there are three genomic evolutionary events that need to be considered. Figure 1 shows these events and listed in chronological order are:

  1. a whole genome duplication event that is shared among all the grasses
  2. the divergence of the maize and sorghum lineages
  3. the maize lineage-specific whole genome duplication event

Each one of these events creates a copy of the genome, and these events can be seen in a syntenic dotplot between these genomes.

Whole genome analysis using syntenic dotplots

A whole genome syntenic dotplot takes two genomes and lays them out end-to-end along each axis. In Figure 2, the sorghum genome is on the x-axis, and the maize genome is on the y-axis. Each black vertical and horizontal line delineates a chromosome. Each gene from those genomes are compared to one another and a dot is drawn at the appropriate x-y coordinate if two genes are similar in sequence. Genes with similar DNA sequence are putative homologs. These results are then fed into an algorithm to find collinear series of genes. If two genomic regions are related to one another through common descent from the same ancestral genomic region, then they will maintain a collinear arrangement of genes from that ancestor. While genomes can change, genes can move to new genomic positions, and duplicate genes lost, this pattern of collinear gene arrangement will be discernible for long evolutionary time periods and can be used to infer that two genomic regions are related through common ancestry (synteny). When such collinear arrangements are detected in this syntenic dotplot, those dots get colored. We call pairs of genes in a collinear arrangement syntenic gene pairs, or syntelogs.

Methods

  1. Use SynMap to generate whole genome syntenic dotplots

Relative dating of genomic events and syntenic relationships

Since the whole genome duplication and lineage divergence events happened at different times in the history and evolution of maize and sorghum's lineages, the gene-pairs derived from those events are also of different ages. One way to measure the relative age of a pair of related genes is by estimating their rates of synonymous mutations. Genes that are more closely related usually have fewer synonymous changes than genes that are more distantly related. The rate of synonymous change has been measured for each pair of syntelogs identified in the maize-sorghum syntenic dotplot, and colored such that younger syntelogs (lower number of synonymous changes) are colored red, and older syntelogs (higher number of synonymous changes) are colored purple. Looking at the syntenic dotplot, it is now easy to identify red, younger sytnenic regions and purple, older syntenic regions.

Looking closely at the syntenic dotplot, there is an overlap of these colored lines when the lines are projected to one axis or the other. This is because a given region of one genome is syntenic to multiple regions in the other genome. Based on the series of events listed above, it is expected that for every region of the sorghum genome, there will be two red lines in maize because maize has had a whole genome duplication event after these lineages diverged. On the other hand, for each region of the maize genome, there will only be one red line in sorghum.

Understanding the purple lines is a bit more complicated. These syntenic regions are derived from the older shared whole genome duplication event. As seen with the red lines, for a given region of sorghum, there are two purple lines that come from maize's most recent whole genome duplication, and for a given region of maize, there will be a single purple line in sorghum.

All together, this means that there is a 2:4 syntenic relationship between sorghum and maize. There are two in sorghum form the pre-grass whole genome duplication event, and there are four in maize from the pre-grass whole genome duplication event combined with the subsequent maize-specific whole genome duplication event. This means that for any genomic region in maize or sorghum, there are a total of 5 other syntenic regions. This gives rise for the possibility of comparing 6 syntenic regions at once: 2 from sorghum and 4 from maize.

Methods

  1. Use SynMap to generate whole genome syntenic dotplots and use options to calculate synonymous rate data and color syntenic gene pairs accordingly


High-resolution analysis of syntenic regions using GEvo

Another way to see these patterns is through high-resolution analysis of syntenic regions use GEvo. If SynMap is used to create and visualize syntenic dotplots, the results are interactive and provide links to GEvo. Figure 3 shows an example 6-way comparison of syntenic regions from maize sorghum dating back to the pre-grass whole genome duplication event. Each panel of the figure represents one genomic region. In this figure, the two sorghum regions derived from the pre-grass whole genome duplication event are the middle two panels, with two maize syntenic regions located above or below each sorghum region. These pairs of maize regions are derived from the maize-specific whole genome duplication event, the pairs of maize regions are orthologous to the closest sorghum region (derived from the divergence of their lineages), and the two sorghum regions are paralogous (or homeologous) to each other (derived from the pre-grass whole genome duplication event).

Methods

  1. Using SynMap's results, select chromosome comparisons with syntenic regions of interest
    1. If you are identifying all maize regions that are syntenic to a region of sorghum chromosome one, open all maize-sorghum chromosome comparisons with syntney by clikcing on chromosome-chromosome comparisons
  2. When the zoomed-in SynMap chromosome dotplot appears, click on a gene-pair in a syntenic region of interest. The cross-hairs in SynMap will turn red when you are on a gene-pair.
  3. This will launch GEvo with that gene-pair pre-loaded and using blastz as the alignment algorithm
    1. Note: BlastZ is good for finding large blocks of similar sequences
  4. Run the analysis
  5. Find additional syntenic genomic regions of interest using SynMap and noting the genomic coordinates, launch and run GEvo.
    1. For example, if the first analysis was at position 1,000,000 on chromosome 1 of sorghum, find that position in each other maize comparison to sorghum chromosome 1
  6. Merge GEvo analyses together
    1. Note: When comparing genomes with lots of internal repeated sequences (such as transposons in maize), masking all sequences besides CDS sequences will make the GEvo analyses run much faster
  7. Adjust the extent of regions analyzed
  8. Reverse complement sequences as need be to make the identification of collinear regions easier

Fractionation of gene content following whole genome duplication events

In figure 3, pairwise comparisons of these regions have been performed in order to identify similar protein coding DNA sequences. For several comparisons, colored lines have been drawn connecting regions of sequence similarity. It is apparent that these lines have a collinear arrangement, and is evidence that the regions are syntenic. However, notice how there are different densities of lines for different comparisons. Each sorghum region has lines drawn connecting it to its two orthologous maize regions, and to the other sorghum region. When comparing the pair of sorghum regions, not all of the genes are shared. This is due to a process known as fractionation. Following a whole genome duplication event, many duplicated genes are lost from one homeologous region or its partner region over evolutionary time.

Fractionation is also seen between pairs of maize regions derived from the maize-specific whole genome duplication event. Figure 4 shows a high-resolution analysis using GEvo of a sorghum region to its two syntenic orthologous regions from maize. While a given sorghum region has nearly its entire gene content represented in its two orthologous maize regions, some genes are represented only in one of the two regions. This is due to genes duplicated in the maize-specific whole genome duplication event being lost from home homeologous region or the other.

Methods

Using the previously generated results

  1. Skip the bottom three sequences in the previous GEvo analysis by:
    1. Open the "sequence options" menus for sequences to be skipped
    2. Select the option to skip sequence
  2. Select sub-regions for further analysis
  3. Re-run the analysis


Identifying conserved non-coding sequences (CNSs)

Once syntenic gene sets are identified, these can be analyzed for conserved non-coding sequences (CNSs). CNSs are DNA sequences, usually neighboring genes, that do not code for protein and are nonetheless conserved over evolutionary time similarly to protein coding sequences. Because of their sequence conservation, CNSs are thought to have a biological function for which selection is preserving the DNA sequence. An example of such a function are transcription factor binding sites.

Figure 5 shows a high-resolution analysis using GEvo to identify CNSs. From this figure, there are several regions of sequence similarity that are collinear with respect to neighboring genes and do not overlap protein coding regions (identified as green arrows in these panels.) Interestingly, some CNSs are retained in both regions of maize.

Methods

Using the previously generated results

  1. Select sub-regions for further analysis
  2. Change the alignment algorithm to blastn and use settings for finding CNSs
  3. Re-run the analysis

Concluding remarks

This example shows how comparing genomes at a variety of levels are used to identify specific genomic patterns.

  1. Whole genome syntenic dotplots identify syntenic regions
  2. Coloring gene-pairs in syntenic regions according to the evolutionary distance classify syntenic regions to specific evolutionary events
  3. High-resolution analysis of syntenic regions identifies:
    1. patterns of fractionation
    2. conserved non-coding sequences

Download Data

The syntenic data for maize and sorghum is available here.

The classical maize gene list containing ~460 of most widely studied maize genes, their location in the current genome, and linked to MaizeGDB and their syntenic regions in maize, sorghum, brachypodium, and rice.