Analysis of variations found in genomes of Escherichia coli strain K12 DH10B and strain B REL606 using SynMap and GEvo analysis

From CoGepedia
Revision as of 12:01, 18 November 2009 by Ambz (Talk | contribs)

Jump to: navigation, search

In this exercise you will compare the genomes of two Escherichia coli strains, K12 DH10B and B REL606, using whole genome syntenic comparison and high-resolution analyses of specific genomic regions. These analyses will use CoGe's tools SynMap and GEvo respectively, and will reveal evolutionary changes between these two genomes that happened after the divergence of their lineages. While the nucleotide sequence of these genomes is identical over large expanses of their genomes, many other types of large-scale genomic change will be discovered including phage insertions, transposon transposition, and genomic insertion, deletion, inversion, and duplication events. The computational tools used to do these analyses can be used for comparing genomes of any organisms.

First, you are going to identify syntenic regions between these genomes. Synteny is defined as two or more genomic regions that share a common ancestry and thus are derived from a common ancestor. To do this, you are going to construct a Syntenic dotplot of K12 DH10B and B REL606 using SynMap. Go to SynMap Search for these E. coli strains in CoGe's database by typing in part of the their names in the "Name" search boxes for Organism 1 and Organism 2. For example, search for "DH10B" and "REL606" respectively, or type "escheri" in both boxes. Once CoGe has found organisms matches these names, make sure they are selected in the Organism List. While there are several parameters that can be configured when generating a syntenic dotplot using SynMap, the default settings work well for most situations, and very well for closely related organisms. Click "Generate SynMap" to start the analysis.

Webpage of SynMap

A lot of processing is happening behind the scenes, but the general way a syntenic dotplot is created is the genomes are compared to one another in order to find putative homologous genes between them, and then these pairs of genes are processed to find collinear series of genes in both genomes. The general principle is that the most likely and parsimonious way two genomes have a collinear series of homologous genes is those genomic regions in each organism are derived from a common ancestral genomic region (hence they are syntenic).

When finished, SynMap will display a dotplot. Each axis of the dotplot is in nucleotide units and represents one of the two genomes laid end to end. The lower-left corner represents the start of each genome (usually 'ORI' for circular bacterial genomes), and the end of each axis is the end of each chromosome. Each putative homologous gene-pair is drawn as a gray dot on the dotplot with its position corresponding to the genomic position of each gene in their respective genomes. Gene-pairs that have been identified has being syntenic are colored green. The collection of these dots appear as green line, which for the comparison of these two genomes, in nearly continuous along the entire length of both genomes.

Syntenic dotplot of Escherichia coli strain B REL606 and strain DH10B. The genomes are laid on the axes REL606 (x-axis) and DH10B (y-axis). The numbers correspond to the individual analysis of the "breaks" in the dotplot which could be found here :Syntenic dotplot
If you look closely at the syntenic dotplot of these genomes, you'll notice that the syntenic line is not perfect, and there are many "breaks" or discontinuities between them. In the figure above, these are indicated by a numbered arrow. These breaks in the syntenic path between these genomes are due to genomic changes happening at a larger scale than a single nucleotide polymorphism, and are mostly likely due to the insertion or deletion of a many nucleotide chunk in one of these genomes. In order to accurately account and characterize these or discontinuities in the dotplot, you need to perform a high-resolution analysis of these regions use GEvo. GEvo allows you to run pairwise comparisons between multiple genomic regions where you can specify how big of a genomic region to analyze. More information on GEvo software tool can be found at: GEvo. To analyse each of these "breaks" in the sytnenic dotplot, first click on the dotplot. This will open a new window with a close-up of the dotplot. When this window appears use the cross-hair locator thar appears when you mouse over the dotplot and place it on the green spot right before a "break". The locator will turn "red" when it is over a gene-pair that can be used as a link to GEvo. When the locator has turned red, click. For example, in order to visualize GEvo analysis of "break" number six, position the locator where the number six arrow points on dotplot. Click when the locator turns "red" or use this address: tinyurl.com/yl4vlbb to regenerate this particular analysis.
Red cross positioned at sixth break. Clicking here will open a new page of GEvo containing sequence information of both strains at this locus. Click "Run GEvo Analysis!" to visualize syntenic genes at this location

After positioning the locator and clicking , a new page for GEvo will appear displaying the sequence information corresponding to our region of interest in the dotplot. These sequences are "anchor" points into these two genomes for specifying the genomic regions to be compared. When linking to GEvo, SynMap automatically sets GEvo to specify using 50,000 nucleotides to the left and right of the anchor point. By default, GEvo will use BlastZ for its sequence comparison algorithm, which is a good choice for identifying large blocks of similar sequence. These settings (~100kb of each genome; BlastZ) usually work well for an initial analysis, and all you need to do is click "Run GEvo Analysis!".

Webpage of GEVo

Once GEvo analysis appears, we can begin to look for and characterize the differences between these two genomes at this syntenic region. GEvo's results will show two panels, one for each genomic region. The dashed line in the middle of each panel separates the top and bottom strands of DNA. Gene models are drawn as green arrows above and below this line, and clicking on a gene will cause its annotation to appear in a box.


To determine identity of genes, click on individual genes and its annotation will appear in a box

The pink blocks in these panels are genomic regions identified by BlastZ as being similar in sequence composition. If you click on a pink block, a transparent wedge is drawn connecting it to its partner region in the other genome, and information about the blast hit (also known as an HSP) is shown in an information box. Similarly, click on all the pink blocks displayed to connect every region of sequence similarities. Since we are analyzing "breaks" in our dotplot, we expect to see insertion(s) and/or deletion(s) in genomes. These indels will be evident by the pattern of transparent wedges and missing pink blocks.

Let us look at an example of deletion in DH10B with no apparent changes in the corresponding region of genome in REL606. This corresponds to the GEvo analysis of the first break in our dotplot. You can regenerate the high resolution analysis here [1]. Click on all the pink blocks as to connect every region of sequence similarity. Note that deletion in DH10B creates a gap between the transparent wedges. Click on individual genes in REL606 to identify which genes are missing in DH10B. We can conclude that this is an example of deletion because of presence of pseudogenes at site of deletion in DH10B. Perhaps transposon(s) insertion created pseudogenes that later got deleted. Remember we are looking at these genomes at a single time point and trying to trace back its history. Based on this fact, we can hypothesize that at some point in past, tranposon(s) had integerated into what is now a site of deletion in DH10B. As we will see that transposition is the most common cause of changes introduced in genomes and so it is most likely that this deletion in DH10B was the result of transposon insertion.
DNA composition of REL606(top) and DH10B (bottom) at the first break. Every syntenic region is connected by transparent wedge projecting from pink blocks. The genes shown in red color towards the left on DH10B are pseudogenes and support the argument that this is a deletion site. Pseudogenes were created by insertion of transposon(s). The transposon(s) "hopped" to another location while the pseudogenes got deleted
When you click on the dotplot, GEvo will display 50000 nucleotides towards the left and right from the anchor position. In order to view genes at a high resolution, use side bars to restrict the region of genome being displayed. This will magnify the region of interest and allows to view minute details. This is particularly helpful when finding direct repeats to account for insertions.
Side bars will magnify the region. This is GEvo analysis of the third break. Notice the inserted region in REL606 is bordered by direct repeats. The pink bars at the bottom of the genes connect to a common syntenic region (some repeats) in DH10B
Insertions in bacterial genome could be from exogenous DNA such as plasmids and phages or it could be the result of transposition or translocation. Let us consider the third break in our dotplot. Run GEVo analysis. Notice that on the edges of "inserted" genes in REL606, the pink wedges overlap at a common syntenic region in DH10B. These genes are apparently missing in DH10B at this particular locus. Notice that the information box that appears with connectors show almost the same percent identity for both ends ~99%. These are direct repeats.
Higher resolution analysis of the direct repeats. There was a staggered cut in DNA of REL606 followed by insertion of ~15 genes which created direct repeats. Click on the pink blocks above the repeats and its percent identity to syntenic region in DH10B will be displayed in a box. Notice that both regions in REL 606 have same % identity to the syntenic region in DH10B

Click on the inserted genes to view their annotations. There is a lac operon and few other metabolic genes. But how can a critical operon be missing from E-coli DH10B? If you go back to our dotplot and align the locator on third break, you will see a fragment of green line right above the break. Position the locator on it and click. GEvo analysis will show that the region of "inserted" genes in REL606 is syntenic to region on a farther locus in DH10B. You will see lac operon and other metabolic genes which are missing in DH10B at third break are present elsewhere on its genome. This is an example of translocation. In this case syntenic regions are not collinear i.e they do not share same locus on a chromosome.

Let us consider the ninth break in the dotplot. Run GEvo Analysis for this break. Magnify the region from 50k to 100k using side bars. Notice that a number of flagella genes are present in DH10B but are missing in REL606. The presence of direct repeats is not evident yet and we cannot conclude that insertion happened in DH10B. We will BLAST DH10B against itself to see any direct repeats. If it is infact, an insertion then we expect to see color coded bars bordering the DNA region containing these flagella genes in DH10B. Click on "Add sequence" and copy and paste the gene ID of DH10B into "name" box in the third sequence. This option allows us to view multiple sequences at the same time. Adjust the number of sequences on left and right to view syntenic regions on all three sequences.
The sequence number could be manually adjusted by adding/deleting nucleotides from left and right. This GEvo analysis page corresponds to the analysis of ninth break and we want to see direct repeats in DH10B by BLASTing it against itself.
Click "Run GEvo Analysis!".You will see various small color coded blocks beneath the inserted genes in DH10B. Click on the blocks bordering the inserted segment. The individual blocks will interconnect and create a criss cross pattern of transparent wedges. Notice that the information box will show 100% identity between the two. This means that the repeats bordering the inserted regions are exactly the same (100% identical)
Example of comparing multiple sequences simultaneously to look for direct repeats. Notice the criss cross pattern that the wedges create. Direct repeats provide evidence that insertion happened in DH10B

These are direct repeats and evidence for insertion in DH10B. Direct repeats allows us to infer that there was a staggered cut in the DNA of DH10B followed by insertion of linear piece of DNA. The gaps at the ends were repaired and ligated. These repaired gaps became direct repeats that flank the DNA segment containing flagellar genes. Notice that there is not much support that this is a translocation. You can click on several green dots that align with ninth break on our dotplot. These regions are mostly repeats which are syntenic all over the genome and are therefore "noise" in our data.

Next, we will look at phage insertion. Consider the second break in the dotplot. Run GEVo analysis on this break. Before checking for direct repeats, determine the identity of genes inserted in DH10B. These are phage-specific genes. CP4-6 prophage has integrated its DNA at this locus as seen by CP4-6 specific integrase, DNA binding protein etc.
In DH10B, the inserted genes are CP4-6 prophage specific as seen by gene annotations.

Next we will look at an example of inversion. Consider the twelfth break on the dotplot. Run GEvo Analysis on this break. Notice the syntenic regions between 10K and 30K on DH10B. Click on the pink blocks and notice the pattern of transparent wedges that connect the syntenic regions. These genes are inverted. Notice IS10R and IS10L bordering the inverted region. These IS elements have created inverted terminal repeats (ITR) which tend to invert or flip the genes within them.

In DH10B, the insertion of IS10R and IS10L has created inverted terminal repeats i.e the IS10 (IS10R and IS10L) transposons are integrated in opposite orientations. A cross over between these two transposons has inverted the DNA segment(three genes) within it. Notice the patterns of wedges connecting the syntenic region (they cross each other). Use side bars to better visualize this inversion event

Detailed analysis of each break can be found here :Syntenic dotplot