From CoGepedia
Revision as of 18:04, 20 June 2010 by Elyons (Talk | contribs) (Handling lists of Genomic features: FeatList)

Jump to: navigation, search

Bioinformatics Workshop on CoGe for the Society for Invertebrate Pathology


A hands-on Bioinformatics Workshop lead by Dr. Eric Lyons, University of California, Berkeley, USA will be held immediately prior to the SIP annual meeting.

8000 Genomes At Your Fingertips:

Comparative Genomics using CoGe

9 am – 4 pm Sunday, July 11, 2010

Currently, there are genomes available for over 8000 organisms, with new ones being sequenced at an exponential rate. New DNA sequencing technologies make obtaining genome sequences easier and faster, and modern researchers must understand genome biology in order to take advantage of this wealth of data. CoGe, a web-based comparative genomics system, permits researchers anywhere in the world to easily analyze and compare genomes. CoGe maintains a repository of all genomes publicly available, and its suite of interconnected analytical and visualization tools allows researchers to rapidly understand genome structure and evolution from the level of a gene to the genome. This workshop lead by CoGe’s creator, Dr. Eric Lyons of the Department of Plant and Microbial Biology at UC Berkeley, USA, provides an introductory description of the structure, evolution, and dynamics of genomes that will be followed by a hands-on in silico training session using CoGe to characterize the structure and evolution of genomes. Workshop participants are encouraged to propose their invertebrate pathogens of interest for exploratory de novo research.

Specific topics will include:

  1. Producing an overview of a genome including GC DNA content bias, amino acid usage, codon usage
  2. Whole genome comparisons using syntenic dotplots
  3. Visualizing the differences in genome structure and evolution among the major domains of life
  4. Identifying ancient whole genome duplications, and regional inversions and rearrangements
  5. Gene duplications, translocations, deletions, and conserved non-coding sequences



  1. Name, research area of interest, organisms of interest

Background on CoGe

Understanding how CoGe is put together will help you understand how to best use CoGe. Its design and architecture allows you to create your own analysis workflows that are guided by your research questions and not necessarily by what the tools can do. There are several dozen tools in CoGe, each designed to perform one task. These tools are linked together so that the results from one analysis flow seamlessly into the next. At each point there is access to the underlying data and raw analyses

  1. Multi-genome support: concurrent storage of any version of any genome from any organisms in any state of assembly
  2. Interconnected web-based tools:
    1. Accessible from anywhere in the world
    2. Let our servers do the heavy lifting
    3. Web-links create automatic ways to save an analysis for future research, collaboration, and publication
    4. Creates open-ended analyses driven by research questions
    5. Permits easy linking in other web-based tools to and from CoGe
  3. Importance of visualization and interactions

Get to know your genome: OrganismView

This tool lets you search for a genome of interest and get information about it.

  1. Searching and find a genome of interest
  2. Overview of data contained within a genome
  3. Getting annotated Genomic features
  4. Generating histograms of CDS GC content
    1. Identifying high/low GC protein coding sequence
  5. Generating codon usage charts
  6. Generating amino acid usage charts
  7. Exporting Data
    1. Whole genome sequence
    2. Sequence from a chromosome/contig
    3. Annotations
  8. Searching for related organisms by taxonomic description

Browsing a genome: GenomeView

This tool is CoGe's dynamic tool for visualizing genomes. The graphics used in it to represent genomic data are also used by other tools analyzing genomic data. Understanding these graphics is important for understanding other analyses.

  1. Understanding the graphics
  2. Understanding the interface
  3. Understanding the GC content filter
  4. Getting annotations
  5. Extracting sequence
  6. Extracting Genomic features

Handling lists of Genomic features: FeatList

This tool is the central hub of CoGe for dealing with lists of genomic features. It is often used to by other tools to handle a list of features, and it is itself used to send features to other tools for analysis.

  1. Changing viewable columns of data
  2. Getting annotations
  3. Getting codon and amino acid usage tables
  4. Exporting data:
  5. Linking to other tools
    1. FastaView

Searching for homologous: CoGeBlast

CoGe's interface to blast algorthms permits the selection of any set of genomes of interest. Its results are presented in a highly graphical and interactive way to make the evaluation of blast hits easy. Because CoGe is linked to its genomes database, blast hits can be used to identify annotated genomic features that can subsequently be exported to other tools in CoGe. Overall, this tool makes it quick and easy to find homologs in any set of genomes for downstream analyses.

  1. Homolog, paralog, ortholog, in-paralog, out-paralog, xenolog -- what's the difference?
  2. Selecting genomes to search
  3. Choosing a blast algorithm
  4. Interpreting results
    1. Genomic overview of hits
    2. Table of hits
    3. Visualization of hits in context of the query sequence and target genomic region (what is a good homolog?)
    4. Blast hit details
    5. Selecting and sending hit genomic sequences to other tools

Example Analysis 1: Building a phylogeny

A key part of comparative genomics is first understanding the evolutionary relationships among a set of organisms. Phylogenetic trees are a great map to have when trying to unravel the evolutionary history of a set of genomes. Often, such exploratory analyses begin with closely related genomes and then work out to more distantly related genomes. This exercise uses CoGe's tools to build a list of homologous genes, and then sends those gene sequences to for phylogenetic tree reconstruction. The example organism is Autographa californica nucleopolyhedrovirus (AcMNPV).

  1. Get a list of all protein coding features in Autographa californica (quick link)
    OrganismView Features.png
    1. Search and find the genome of Autographa californica nucleopolyhedrovirus in GenomeView
    2. Under "Genome Information" click on "Click for Features". This will display a table of all the genomic features near the bottom of the screen, below "Chromosome information".
    3. In this feature list ("Features for . . . Autographa californica. . ."), click on "Feature List?" next to the entry for CDS. This will open FeatView will all the protein coding sequence for Autographa loaded.
  2. Find a gene of interest: polyhedrin (quick link)
    1. Fetch and display all the annotations for these features by clicking "Get all" located under the table column header "Annotation"
    2. Filter the features to find the one that contains the word "polyhedrin" in its annotation by typing the work "polyhedrin" in the box next to "Filter Table Rows" located above the table.
    3. Check the box next to that feature
    4. Send the feature to CoGeBlast by selecting CoGeBlast in the drop down menu next to "Send Checked Features to:" and pressing the "Go" button. This menu is located below the feature list.
  3. Searching for homologs of polyhedrin using CoGeBlast (quick link)
    Polyhedrin blast hit.png
    1. When CoGeBlast loads, the sequence sent from FeatList will automatically be added to the query sequence submission box located near the bottom of the screen ("Sequence (fasta format)")
    2. Search for baculoviruses by typing "baculo" into the text box next to "Organism Description:"
    3. Add all baculovirus genomes to the "Genomes to BLAST:" list by pressing "Add all listed" located below the organism search list.
    4. Since polyhedrin genes may rapidly evolve (and viruses evolve rapidly in general), you will want to do a protein blast search. Change from a nucleotide search to a protein search by selecting the button next to "Protein Sequence" located to the left of the query sequence. When protein sequence is selected, the sequence loaded from FeatList will automatically be translated into protein sequence
    5. Run the blast search by pressing the button "Run CoGeBlast" located near the top of the screen
    6. Extra: Compare the blast results using blastn and blastp. You will see that several genomes do not have matches using blastn that do have matches using blastp (e.g Helicoverpa armigera granulovirus)
    7. Evaluate the blast hist to identify those with a large amount of query sequence coverage. All genomes matched will have a single blast that overlaps nearly the entire query sequence. Genomic features in the Target genomes overlapping the blast hits are putative orthologs to Autographa californica's polyhedrin gene. The only except is a second blast hit found in the genome of Agrotis segetum nucleopolyhedrovirus. This hit cover more than half the query sequence but does not overlap an annotated genomic feature in the Agrotis segetum nucleopolyhedrosisvirus genome
    8. Select all the blast hits by pressing "Select All" located under the "HSP Table", and then deselect the second blast hit for Agrotis segetum nucleopolyhedrovirus
    9. Select "Phylogenetics" from the drop down menu located next to "Send Checked Features to:" found below the "HSP Table", and press go. This will send the genomic features in the Target genomes overlapping blast hits to FastaView
  4. Using FastaView to send sequences to the phylogenetics website, (quick link)
    1. When FastaView loads, the nucleotide sequences of the features selected in CoGeBlast will be displayed in Fasta format
    2. Convert these sequences to protein sequences by pressing the button "Protein Sequences" located below the sequence display box
    3. Send these sequences to by pressing the button ""
  5. Simple phylogenetic analysis at
    1. contains a tool set for performing various manipulations of sequences, algorithms, and visualizations for phylogenetic tree reconstructions. You can find information about all of their offerings at:'s documentation. In addition, contains a pipeline for "one click phylogentics" which will run a series of programs to go from a set of fasta sequences to an image of a phylogenetic tree using a very good set of algorithms and options. Their pipeline consists of:
      1. Sequence Alignment: MUSCLE
      2. Alignment Refinement: Gblocks
      3. Phylogenetic tree reconstruction: PhyML
      4. Tree rendering and visualization: TreeDyn
    2. When sequences are sent from FastaView to, CoGe automatically sends them to its "one click" phylogenetics mode and clicks their button to start the analysis. All you need to do is wait.

Comparing Genomic Regions: GEvo

GEvo is CoGe's tool for dynamically comparing multiple genomic regions. It allows you to specify genomic regions using a gene name, fetching a sequence from NCBI using a GenBank accession, or pasting in your own sequence. The extent of the genomic regions are up to you to define, but you can easily expand or contract the amount of sequence analyzed, reverse complement a sequence, and dynamically mask portions of the sequence based on annotated genomic features (e.g. non-CDS regions). There are a handful of sequence comparison algorithms to choose from and a variety of ways to manipulate the genomic regions.

  • Example region: using orthologous syntenic regions from Saccharomyces cerevisiae and Zygosaccharomyces rouxii
  • Example region: using Homeologous syntenic regions from Arabidopsis thaliana
  • Example region: using orthologous syntenic from Wolbachia endosymbiont of Drosophila melanogaster strain wMel and Wolbachia endosymbiont strain TRS of Brugia malay
  1. What is Synteny?
  2. How to identify syntenic regions using collinear gene arrangements.
  3. Sequence submission boxes
    1. Specifying a genomic region by gene name
    2. Fetching a sequence from NCBI
    3. Directly pasting in a sequence
  4. Adding in a new genomic region to the analysis
  5. Sequence submission box options
    1. Changing the extent of a region analyzed
    2. Reverse complementing a sequence
    3. Dynamically masking a sequence with annotated genomic features
    4. Skipping a sequence
  6. Changing the genomic extent of all regions at once
  7. Algorithms:
    1. blastn good for identifying small regions of sequence similarity
    2. blastz good for identifying large regions of sequence similarity
  8. Manipulating the image:
    1. Changing the size of the image
    2. Coloring background GC content
    3. Coloring background masked and unsequenced content
    4. Coloring CDS by wobble GC content
    5. Auto-adjust overlapping HSPs and genomic features
    6. Coloring features overlapped by HSPs
    7. Coloring matches in HSPs
  9. Results:
    1. Identifying regions of sequence similarity
    2. Connecting regions of sequence similarity
    3. Getting information on an HSP
    4. Getting information about a genomic feature
    5. Changing the extent of the region analyzed
    6. Exporting the selected sequence to SeqView
    7. Viewing the genomic region in GenomeView
    8. Saving the analysis for future work (saving a link to regenerate the analysis)

Example Analysis 2: Comparing genomic regions of the baculovirus most closely related to AcMNPV

In this example, we'll be using the results from Example 1 to extend out analysis of AcMNPV. Previously we identified homologs to the polyhedrin gene of AcMNPV and used to build a phylogenentic tree of their evolutionary relationships. We will use this gene tree as a proxy for the species tree of these viruses in order to identify the baculoviruse most closely related to AcMNPV.

  1. Previous, a list of putative homologs to AcMNPV's polyhedrin gene were identified in [[CoGeBlast]. These were selected and sent to FastaView in order to generate their protein sequences and send them to for phylogenetics. There are several other programs in CoGe to which identified genomic features may be sent. Select "Feature List" and press the "Go" button located next to the option. You can save the URL for this page in order to regenerate this feature list in the future.
  2. Using this feature list (or the original list in CoGeBlast), select the polyhedrin gene in the strain that is evolutionary closest. To figure this out, use the phylogeny generated at (image). Note: You can use the "Filter Table Rows" text box to quickly find features, but remember that it is case-sensitive.
  3. The organism that is closest to AcMNPV is Plutella xylostella multiple nucleopolyhedrovirus (PxMNPV). When the two polyhedrin genes from these organisms are selected, select "GEvo" from the drop-down menu located at below the list next to "Send Checked Features to:" and press "Go".
  4. GEvo will load with these two features automatically entered and selected in two sequence submission boxes. (quick link)
  5. GEvo's default options are usually a good place to start. Without changing any of the parameters, press "Run GEvo Analysis!" and run the analysis.
  6. GEvo's results show two panels, one for each of the genomic regions analyzed. (quick link)
    1. The graphics show genes as green arrows located above and below a dashed line (top and bottom strand respectively). Note that one gene in each region is colored yellow. These are the genes used to anchor these genomic regions (polyhedrin). To check that, click on the gene to get a dialog box with its annotations. These dialog boxes can be moved around the screen by clicking and dragging on the header, and closed by pressing the "X" located in the upper right corner. The annotation information contains links to get more data about a gene such as its sequence.
    2. Note the pink boxes above and below the genes. These represent regions of identified sequence similarity in the same or opposite orientation respectively. Click on a large pink box to draw a transparent wedge connecting the boxes between the two regions. This also causes a dialog to open (or change the information if the dialog box is alread open) with an overview of information about that HSP (aka blast hit). To get the full information about that HSP including an alignment, press the "full annotation" link at the top of the dialog box. The large HSP shows that these regions are 99.27% identical.
    3. To draw more than one connecting wedge at the same time, you can either click and drag on a panel to draw a box, and all HSPs within that box will have wedges drawn. If you want to draw them all, press and hold the shift key while clicking on a pink box.
  7. Draw wedges for all regions of sequence similarity by pressing and holding the shift key while clicking on a box.
    1. Note that there are several overlapping wedges drawn in near the center. At this region is a blue and black box outline. These glyphs denote repeat regions. If you click on one, you'll see its annotation which is a potential enhancer and origin of replication consisting of 30bp of imperfect palindromic repeats.
  8. GEvo-AcMNPV-vs-PxMNPV-repeat-region.png
    To examine the center repeat in more detail, drag the slider bars located at the ends of each genomic region's panel towards the repeat. Leave about a gene of sequence on either side of the slider bars. This will automatically adjust the amount of left and right sequence for each region in the sequence submission box. In addition, change the display options so that overlapping HSPs are automatically adjusted and color matches in HSPs so sequence differences are more easily visualized. These options are found under the "Result Parameters" tab located below the "Run GEvo Analysis" button. When adjusted, press the "Run GEvo Analysis" button to re-run the analysis.(quick link)
  9. Next let's compare the entire genome. First turn off the auto-adjust overlapping HSP and color matches in HSPs options. Return to the sequence submission tab by clicking on it. Set the amount of sequence analyzed to 300000 for all regions by entering that number in the box next to "Apply distance to all CoGe submissions" located near the bottom of the sequence submission box. Note: we are requesting more sequence than contained in these genomes both up-stream and down-stream from these genes. GEvo will only return sequence up till the beginning and end of a genome, so over-shooting the ends will not result in an error. Press "Run GEvo Analysis" to re-run the analysis. (quick link)
    1. Overall, you'll find that these genomes are nearly identical. However, there is one region in AcMNPV that is not present in PxMNVP and two regions in PxMNVP that are not present in AcMNPV. To analyze these regions in more detail, you can adjust the slider-bars to zoom in on those regions with missing sequence.
  10. Adjust the slider bars and press the "Run" button to analyze the region in AcMNPV that is not present in PxMNVP (quick link)
    1. There are two possibilities for this difference: 1. a new insertion in AcMPNB or 2. a deletion in PxMNPV. Because there is a gene model in AcMNPV that spans across this missing region in PxMNPV, it is more likely that PxMNPV experienced a deletion than an insertion in AcMNPV that created a gene using pre-existing neighboring sequence. Also, insertions often leave behind a target site duplication due to staggered cuts in the DNA, which are not seen in AcMNPV.
  11. Zoom back out and analyze the whole genome by typing 300000 in the "Apply distance to all CoGe submissions" and pres the "Run" button (quick link)
  12. Now zoom in on the second region present in PxMNPV that is not present in AcMNPV. This is near position 50,0000 in the PxMNPV genome. (quick link)
    1. The sequence present in PxMNPV that is not present in AcMNPV is likely due to an insertion as evidenced by direct repeat sequences flaking the region.
  13. Now zoom in on the first region present in PxMNPV that is not present in AcMNPV. This is near position 26,000 in the PxMNPV genome. Question: What do you think has happened to give rise to this difference in genomic structure? Try to take into account the structure of the gene models as well. (hint: look at the alignments of the blast hits.) (quick link)
  14. Overall, this analysis provides evidence that since these viruses diverged, PxMNPV has had two new insertions and one deletion.

Example Analysis 3: Comparing genomic regions of several baculoviruses related to AcMNPV

In the previous example, you found the baculovirus most closely related to AcMNPV with a genome sequence in CoGe, and used GEvo to perform several high-resolution analyses of their genomes in order to identify and characterize three regions that were present in one genome and not the other. This exercise extends those analyses by adding additional baculoviruses for comparisons. These will be at varying evolutionary distances in order to get an overall feeling of baculovirus genome evolution, and to use outgroup comparisons to validate the conclusions made in the previous example.

Baculovirus phylogeny marked.png
  1. First select several baculoviruses for comparison at varying evolutionary distances from AcMNPV by using the previously generated phylogeny and the feature list of baculovirus polyhedrin genes (quick link of all polyhedrin genes). The ones picked for this example are marked in the phylogeny above. (quick link of selected polyhedrin genes).
  2. After genes are selected, select "GEvo" from the drop down menu next to "Send Checked Features to:" and press "Go"
  3. When GEvo loads, change the extent of all regions by typing "300000" into the box next to "Apply distance to all CoGe submissions" (quick link)
  4. WARNING: this analysis will be using 9 genomic regions which is a lot of data. To make this more manageable, the follow steps will help.
  5. Order the genomic regions so that AcMNPV is the first sequence, and the others are placed according to their evolutionary distance from AcMNPV. You can change the relative order of the sequence by dragging and dropping their sequence submission boxes relative to one another. (quick link of genomes ordered by evolutionary distance from AcMNPV)
  6. Since we are only interested in relationships to AcMNPV, press "Open all sequence option menus" and select "No" for "Reference Sequence" in all sequence submission boxes EXCEPT AcMNPV (quick link)
  7. To make the visualization of these region more manageable, shrink the height of the genomic region panels by selecting the "Results Parameters" tab and changing the "Feature Height" to 10 pixels and the "Padding between tracks" to 1 pixel.
  8. Run the analysis by pressing "Run GEvo Analysis"
  9. GEvo-9-baculoviruses.png
    Looking at the results, you can see that there are a lot of similar regions to AcMNPV across all these genomes. However, some of the colored blocks representing these regions are drawn below the gene models in AcMNPV. These are blast hits that are in the opposite orientation. WhSome of these are due to inversions. To make interpreting these data somewhat easier, find those genomes that are mostly inverted with respect to AcMNPV and flip those genomes by selecting "Yes" for "Reverse complement" in the sequence options menu located in the appropriate sequence selection boxes. (quick link)
  10. GEvo-9-baculoviruses-orientation.png
    From the orientated genomes, it is now possible to see how well the overall structure of the genomes compare to AcMNPV. You can explore this by clicking on the colored blocks and seeing which genome they match. For the most part, genome structure similarity decreases as evolutionary increases (as estimated by the polyhedrin gene). However, there are some exceptions:
    1. RoMNPV appears to have greater continuity to AcMNPV than does PxMNPV
    2. TnSNPV appears to be much more similar in overall structure to AcMNPV than does Xn granulovirus (XnGV).
  11. We can inspect these comparisons in more detail by removing non-relevant genomes from the comparisons.
  12. Let's compare AcMNPV, RoMNPV, and PxMNPV. To do this, select "Yes" for "Skip sequence" located in the sequence options menus of the sequence submission boxes for all other genomes. Next, turn both RoMNPV and PxMNPV into reference sequences so they will be compared to one another. (quick link)
  13. GEvo-Ac-Px-Ro-MNPV.png
    Clicking on the colored boxes for regions of sequence similarity will bring up a dialog box with information about those matches. If you do that for the same region in these genomes, you'll find that AcMNPV-PxMNPV are ~98-99% identical at the nucleotide level. Both AcMNPV and PxMNPV are about 95% identical at the nucleotide level to RoMNPV. Thus, even though AcMNPV and RoMNPV have an overall more similar genome structure than PxMNPV to AcMNPV (fewer gene insertions), PxMNPV and AcMNPV are indeed more closely related.
  14. Also, this three-way comparison permits validation of the claims we made at the end of Example 2. Namely, the difference in genome structure betwen AcMNPV and PxMNPV were the result of two new insertion and one deletion in PxMNPV. This pattern holds true with the addition of RoMNPV. Since RoMNPV is more distantly related to AcMNPV and PxMNPV than they are to each other, the changes we see that are unique to PxMNPV must have happened in its lineage.
  15. Also, note that there is a likely insertion that AcMNPV and PxMNPV share, which Ro does not have have at the left side of the image.
    1. Question: How can you determine if that is likely an insertion in the AcMNPV-PxMNPV lineage, or a deletion in the RoMNPV lineage?
  16. Now let's take a look the comparison with TnSNPV and XnGV. Obvious polyhedroviruses should be more closely related to one another than they are to granuloviruses. Reconfigure and run an analysis to compare AcMNPV, TnSNPV, and XnGV. (quick link)
  17. GEvo-AcMNPV-TnSNPV-XnGV.png
    The percent of sequence similarity for regions overlapping the polyhedrin gene are 84% for AcMNPV-TnSMPV and 62% fir AcMNPV-XnGV. This is in agreement with what we assume about the evolutionary relationships of these viruses, but why did the phylogenetic tree place TnSMPV more distantly related to AcMNPV than XnGV? If you look closely at the genes colored yellow (the ones we used in the phylogenetic analysis and to anchor positions in these genomes, you'll see what happened. The yellow gene in TnSNPV is not overlapped by the region with sequence similarity to AcMNPV! To see this in more detail, change the amount of genomic region analyzed to 2000nt for all these regions. (quick link)
  18. GEvo-AcMNPV-TnSNPV-XnGV-polyhedrin.png
    The close-up analysis of the region around our anchor gene revels the error that happened. The anchor gene used for TnSNPV is not polyhedrin! CoGeBlast made a mistake assigning the correct gene to the blast hit (which can happen as these are algorithms). The mistake happened because there is a small amount of overlap between the polyhedrin gene and the one next to in TnSMPV and CoGeBlast picked the wrong one. The important lesson here is to double check spurious results!
    1. Question: Do the other genomes that clade with TnSNPV in the phylogenetic tree also suffer from this error?
    2. Question: Can you identify all the correct polyhedrin genes, create a new FeatList, and send them to for phylogenetic analysis and tree reconstruction?
  19. GEvo-AcMNPV-AgMNPV.png
    Let's now analyze some genomic inversions and translocations. AgMNPV has several of these in relation to AcMNPV. Do a comparison of these two genomes (quick link)
  20. As before, to gain an insight as to the dynamics of these genomes (i.e. which one has had a particular change), comparison to an outgroup genome is necessary. Select any set of closely related genomes and compared them to these two to find one that helps place one of these inversion or translocation events in one lineage. After searching, I found that CfMNPV will be useful ( quick link).
  21. Comparison of the complete genomes of AcMNPV, AgMNPV, and CfMPNV. Notice that AgMNPV and CfMNPV have more similar genome structure to one another than either does to AcMNPV. However, notice that AgMNPV has an additional inverted region when compared to either genome. This analysis can be regenerated at
    First, compare the whole genomes of AcMNPV, AgMNPV, and CfMPNV; look closely at the inversion seen on the left-hand side of these genomes. Overall, AgMNPV and CfMNPV are more similar to one another than either is to AcMNPV (~73% versus ~65% sequence identity respectively), and their overall structure is more similar as well. However, if you look closely at the left hand side of their genomes (where there are several inverted regions with respect to the AcMNPV genome), you will notice that there is one region of CfMNPV that is not inverted with respect to AcMNPV's genome while AgMNPV's genome is. Find this region and zoom-in on it for another GEvo analysis (quick link).
  22. GEvo analysis of three baculoviruses: AcMNPV, AgMNPV, CfMNPV showing an inversion specific to AgMNPV. Analysis can be regenerated at
    By zooming in on this region, the inversion in AgMNPV becomes obvious. Both AcMNPV and CfMNPV's entire region is in the same orientation as indicated by the continuous colored block drawn above the gene models. The colored blocks between either AcMNPV or CfMNPV and AgMNPV is broken, with one block drawn below the gene models. This indicated that this region is in the opposite orientation for AgMNPV. We can deduce that this inversion happened in the lineage of AgMNPV after its divergence with CfMNPV because AcMNPV's genome is like CfMNPV's AND AcMNPV is more distantly related to either of the other genomes. This concept of using an outgroup genome for comparison in order to determine the ancestral state of between two more closely related genomes is very important. Which ever genome has the same state as the outgroup means that the change most likely happened in the other genome. The reason is more likely rather than definite is because there is a chance that both the outgroup genome and one of the related genomes both changed in the same way. However, more outgroup taxa that can be added to the analysis will help to strength this case.
  23. Conclusion: This example analysis represents some of the more complicated types of analyses that can be done with CoGe: comparing genomes at both the structural level and sequence similarity level using outgroups to determine the timing and placement of evolutionary events.
    1. Using a previously built phylogenetic tree of the evolutionary relationships of 50+ baculoviruses, we identified several baculoviruses at varying evolutionary distances from AcMNPV. We next used high-resolution sequence comparisons to examine some discrepancies between the pattern of genome structure conservation and the evolutionary relationships inferred form the phylogenetic tree. In one case, there were several insertions that appeared to break up the continuity of shared genome structure between AcMNPV and PxMNPV. By comparison to the close outgroup genome, RoMNPV, we could determine that these insertions were specific to the lineage of PxMNPV because AcMNPV and PxMNPV have a higher precent sequence identity that either to RoMNPV. These analyses showed how using outgroup genomes for comparison at both the genome structural level and sequence similarity level can determine where and when evolutionary events happened.
    2. Using a similar kind of analysis (genome structure and percent sequence identity), we checked into why the phylogenetic tree was placing TnSNPV more distantly related to AcMNPV than XnGV (the latter is a granulovirus and the previous two are both polyhedroviruses). This turned out to be an error in how CoGeBlast assigned a genomic feature to an overlying blast hit. A neighboring gene was erroneously being used instead of the correct polyhedrin gene. The major lesson here is that it is important to check the results, look for obvious inconsistencies, and investigate possibly reasons for the error.
    3. Finally, we used the same kind analyses to characterize an inversion detected between AcMNPV and AgMNPV. The outgroup CfMNPV had a region that was in the same orientation as AcMNPV. This means that it is most likely that an inversion happened at this location in the AgMNPV lineage because CfMNPV had the same state as AcMNPV.