Difference between revisions of "Using CoGe for the analysis of Plasmodium spp"

From CoGepedia
Jump to: navigation, search
(A brief introduction to Plasmodium genome evolution)
(Finding and importing data into CoGe)
Line 28: Line 28:
 
== '''Finding and importing data into CoGe''' ==
 
== '''Finding and importing data into CoGe''' ==
  
The analysis of ''Plasmodium'' parasites using comparative genomics can be a challenging task due to the previously mentioned particularities of their genomes. Considering that an increasing number of ''Plasmodium'' genomes have become available in recent years, and that the genomic information for the genus is likely to increase in the near future, it is fundamental to search new alternatives for the incorporation, analysis, and visualization of ''Plasmodium'' genomic data. Particularly, tools which allow the rapid analysis of numerous sequences at various levels, and permit the identification of potentially relevant patterns to which novel analyses can be focused, are currently of high relevance for ''Plasmodium'' research. Additionally, the use of online platforms where complex genomic data can be incorporated and analyzed facilitate the start and continuation collaborative initiatives. In particular, these platforms allows for the analysis of data regardless on differences between operative system, geographic location, or even access to high performance equipment, an aspect of large significance in a genus like ''Plasmodium'' which in the case of humans causes diseases associated to developing tropical countries where access to some equipments and software can be reduced.  
+
The initial step in sequence analysis using CoGe is the import of new sequences to the  platform.
  
The initial step in the analysis of sequences using CoGe is the import of new sequences to the  platform.
+
The analysis of ''Plasmodium'' parasites using comparative genomics can be a challenging task due to the previously mentioned particularities of their genomes. Considering that an increasing number of ''Plasmodium'' genomes have become available in recent years, and that the genomic information for the genus is likely to increase in the near future, it is fundamental to search new alternatives for the incorporation, analysis, and visualization of ''Plasmodium'' genomic data. Particularly, tools which allow the rapid analysis of numerous sequences at various levels, and permit the identification of potentially relevant patterns to which novel analyses can be focused, are currently of high relevance for ''Plasmodium'' research. Additionally, the use of online platforms where complex genomic data can be incorporated and analyzed facilitate the start and continuation collaborative initiatives. In particular, these platforms allows for the analysis of data regardless on differences between operative system, geographic location, or even access to high performance equipment, an aspect of large significance in a genus like ''Plasmodium'' which in the case of humans causes diseases associated to developing tropical countries where access to some equipments and software can be reduced.  
  
  
Line 159: Line 159:
  
 
:'''3.''' Wait until the export is completed. From this point forward, your FASTA and GFF files data will be also found in the ''CyVerse Data Store''. Note that no modification can be performed to the uploaded genomes, so it is recommended to keep a list of the uploaded genome codes that is provided by CyVerse and their associated organism or strain.
 
:'''3.''' Wait until the export is completed. From this point forward, your FASTA and GFF files data will be also found in the ''CyVerse Data Store''. Note that no modification can be performed to the uploaded genomes, so it is recommended to keep a list of the uploaded genome codes that is provided by CyVerse and their associated organism or strain.
 
  
 
== '''Using CoGe tools to perform comparative analyses''' ==
 
== '''Using CoGe tools to perform comparative analyses''' ==

Revision as of 16:43, 8 November 2016

About this Guide

Welcome to the Plasmodium spp. genome analysis with CoGe guide. This 'cookbook' style document is meant to provide an introduction to many of our tools and services, and is structured around a case study of investigating genome evolution of the malaria-causing Plasmodium spp. The small size and unique features of this pathogen's genome make it a great example for beginning to understand how our tools can be used to conduct comparative genomic analyses and uncover meaningful discoveries.

Through a number of guided examples, this guide will teach users how to use the following tools:

  • LoadGenome
  • GenomeInfo
  • GenomeList
  • CoGeBLAST
  • GEvo
  • SynMap
  • CodeOn

A brief introduction to Plasmodium genome evolution

The unique features found in many parasitic genomes create unique challenges when using comparative genomics to study their evolution. Parasite genomes are characterized by a mixture of genome reduction associated with gene loss (e.g. homeobox genes), but also for the development of specialized genes. Many of the genes gained in parasitic genomes are involved in different aspects of host-parasite interaction and are, for the most part, species or lineage specific [1]. This dynamic nature of parasitic genomes is especially evident within the phylum Apicomplexa, and particularly within the genus Plasmodium. A marked loss of synteny between different Apicomplexa genera has been previously reported [2], although syntenic relationships between species within a single genus are largely conserved. While this finding remains true for many genera, the increasing number of sequenced Plasmodium genomes has shown that numerous clade and species-specific gain/loss events and chromosome rearrangements have occurred [3]. The exact origins and mechanisms of these rearrangements remains largely unexplored, but they are generally hypothesized to stem from different host shift events [4][5], which have led to diverse types of host-parasite interactions.

Despite the enormous diversity of Plasmodium parasites, all studies to date (2016) show conservation of certain genomic characteristics. Fourteen chromosomes, a mitochondrial, and an apicoplast compose the entire repertoire of the Plasmodium genome in all sequenced species. This conservation in genomic complement is remarkable, especially considering the potential for altering the number of chromosomes without compromising genome the size can be observed ancestrally (e.g. 4 chromosomes and 13Mb approximately in Babesia bovis vs. 14 chromosomes and 18Mb approximately in the smallest Plasmodium genome). As in the case of other parasites, Plasmodium genomes are relatively small (between 17-28Mb approximately) in comparison to those of the hosts, but larger than those of other Apicomplexan parasites (Theileria orientalis and Cryptosporidium parvum have genomes of approximately 9Mb) [6]. All Plasmodium species have a complex life cycle involving some kind of vertebrate host and a mosquito vector of the genus Anopheles. Though specificities and preferences during the infection process are prevalent within the genus [7], the overall preservation of the life cycle characteristics suggests the existence of a set of preserved core genes. These core genes represent are pivotal elements for the use of comparative genomics on the study of Plasmodium evolution.

An increase in funding devoted to malaria research during recent years has come hand in hand with increased understanding of Plasmodium genetics [8]. At the moment, there is an unprecedented amount of Plasmodium genomes and gene sequences publicly available, spread through diverse databases. The most prominent repository is found in NCBI/Genbank [9]; while additional and unique sequences can also be found on other databases: PlasmoDB, GeneDB and MalAvi [10][11][12]. The availability of genomic data from Plasmodium species opens the possibility to: identify the likely origin of certain traits, specialized phenotypes, and genomic landscapes; track the maintenance of conserved genes across the genus, as well as the rise and loss of genes unique to only a single or a group of closely related species; and infer the potential historical interactions which might have lead to the development of adaptations as well as their putative consequences.

One of the many remarkable trends of Plasmodium genome evolution is the rapid change in GC content. P. falciparum and closely related parasites have a remarkably AT rich genome compared to other Plasmodium species [13]. While significant shifts in GC content have been reported in other parts of the tree of life such as Bacteria [14][15] and monocots [16], the short evolutionary time during which this change has occurred in Plasmodium is noteworthy. Moreover, the GC content variability observed amongst Plasmodium species has not yet been observed in other Apicomplexan genera. AT rich genomes not only present challenges for sequencing [17], but they result in entirely different trends of codon and amino acid usage. Furthermore, patterns of genome mutability and in the evolution of repetitive elements can also be markedly different in AT rich genomes. By utilizing various analysis tools for comparative genomics, it is possible to assess the evolutionary origins and trace patterns of GC content shift across the Plasmodium genus.

Another important aspect in Plasmodium evolution is the unique patterns of genome variability and the diverse responses to numerous selective pressures observed in different Plasmodium genomes. In this regard, comparative analyses performed between Plasmodium species and strains can elucidate the key elements behind these differences (e.g. different hosts pressures or an earlier species split), as well as to identify genomic regions and elements where this type of change is more prominent. But perhaps more significantly in Plasmodium evolution, and in that of parasites in general [18], might be the origin and evolution of multigene families. Within the Plasmodium genome, numerous multigene families show specific tracks of gene gain/loss events, and can be associated to variable syntenic changes. Moreover, the differences in the ancestry of these families is also noteworthy, with many of them being observed only in a single Plasmodium species or those which are closely related, and others being observed across the entire genus but not in other Apicomplexa parasites [19]. In this sense, each multigene family can illustrate a different aspect of the evolutionary history of the genus.

In the following paper, we will demonstrate how to use the CoGe platform to analyze genomes and evaluate diverse evolutionary hypotheses. Through a case study on Plasmodium evolution, we will illustrate how CoGe can be used for the analysis of both genes (specifically multigene families) and whole genomes (genome composition, rearrangement events, conservation).

Finding and importing data into CoGe

The initial step in sequence analysis using CoGe is the import of new sequences to the platform.

The analysis of Plasmodium parasites using comparative genomics can be a challenging task due to the previously mentioned particularities of their genomes. Considering that an increasing number of Plasmodium genomes have become available in recent years, and that the genomic information for the genus is likely to increase in the near future, it is fundamental to search new alternatives for the incorporation, analysis, and visualization of Plasmodium genomic data. Particularly, tools which allow the rapid analysis of numerous sequences at various levels, and permit the identification of potentially relevant patterns to which novel analyses can be focused, are currently of high relevance for Plasmodium research. Additionally, the use of online platforms where complex genomic data can be incorporated and analyzed facilitate the start and continuation collaborative initiatives. In particular, these platforms allows for the analysis of data regardless on differences between operative system, geographic location, or even access to high performance equipment, an aspect of large significance in a genus like Plasmodium which in the case of humans causes diseases associated to developing tropical countries where access to some equipments and software can be reduced.


Finding about the Plasmodium genomes already present in CoGe

While the amount of Plasmodium genomic data has risen during the pass years, important advances in Plasmodium genomics have been occurring since the publication of the P. falciparum genome [20]. Thus, there is a prominent amount of historical which can also be used for analysis, and depending of the hypotheses of interest, might be more relevant that later versions of the same data. As a result, there are a number of Plasmodium genomes under different development versions already imported into CoGe.

Before importing any genome into the CoGe database, and in order to prevent potential redundancy of genomic information, it is recommended to identify the Plasmodium genomic data already available. You can identify these genomes by:


A. Typing the word in "plasmod" into the Search bar at the top of most pages. This will retrieve all organisms and genomes with names matching the search term.

Screen Shot 2016-09-29 at 1.43.09 PM.png


B. For a more detailed description regarding the presentation and acquisition of the genomic information available in CoGe, follow these steps:

1. Go to: https://genomevolution.org/coge/

2. Create an account / login into CoGe. See the How to get a CoGe account section on this wiki for more information

3. On the main CoGe page, find the Tools tile and click on to Organism View. This site can also be found by following this link: https://genomevolution.org/coge/OrganismView.pl

4. All publicly available genomes uploaded into CoGe and any corresponding information attached to them can be found in the Organism View section. You can find any published genome by typing a scientific name into the Search box. For each organism uploaded to CoGe you will find the following information:

Organisms: In the case of Plasmodium spp., the different parasitic strains currently uploaded. Any organelle genomes independently uploaded (mitochondrial and apicoplast) can also be found in this section.
Organism Information: provides an outline of the organisms’ taxonomy (following that published on NCBI/Genbank). This section also includes quick links to some of the main CoGe analysis tools and additional search engines.
Genomes: All the genome versions for the species of interest. Note that by selecting different genome versions, all other genomic information associated to that species is modifies on site. This section allows you to access to previous versions of a published genome (e.g. access scaffolds from a previous genome version currently under the chromosome assemble level).
Genome information: Shows the genome IDs, type of sequences uploaded and the length of these sequences. In this tab you will also be able to directly perform analyses using the CoGe platform.
Datasets: This section shows the number of datasets included for the specified genome. In the case of completely sequenced Plasmodium genomes obtained from NCBI/GenBank, it will indicate the accession numbers for each individual chromosome.
Dataset information: Provides specific information for each individually selected dataset including accession numbers (if available), source of the upload, chromosome length, and GC%.
Chromosomes: Shows the number of available chromosome for the selected genome. However, depending of the method used to import the data into CoGe and the nature of the dataset itself, the count and length of chromosomes shown will be larger than expected (e.g. number of contigs in lieu of the number of chromosomes).
Chromosome information: Shows the chromosome ID and the number of base pairs (bp) for that chromosome.

5. By clicking on the Genome Info section within the Genome Information section provides a more detailed description of the genome of interest and allows access to quick links to most comparative analysis tools available on CoGe.


Keep in mind that only publicly available genomes imported to CoGe can have a Public or Restricted display. Genomes made public can be seen and analyzed by anyone using the CoGe platform. On the other hand, Restricted genomes can only be seen/analyzed by the user or those with whom the information has been shared with: Sharing_data


Importing Plasmodium genomes into CoGe

While data can be uploaded into CoGe using a variety of methods, we will focus on two of the most likely to be used in the incorporation of Plasmodium genomes, due to their intuitive use for most databases. For additional information, please check the following link: How_to_load_genomes_into_CoGe

Importing genomes using the "Upload" method

Depending on your interests and hypotheses, it might be necessary to perform analyses using complete Plasmodium genomes or to focus only in specific parasitic organelles and chromosomes. If you desire to import a complete Plasmodium genome (including all organelles and chromosomes), make sure to follow these steps:
Screen capture of P. vivax genome's webpage on NCBI
1. Go to the genome database on NCBI/GenBank and type "Plasmodium" on the search box. You can select any genome of interest, but in this example we will use that of P. vivax (Salvador I strain).
2. Find the Representative Genome section in the upper section of your screen. Below you will find the Download Sequences in FASTA format and Download Genome Annotation sections.
- To download a complete P. vivax genome, click on Genome under Download Sequences in FASTA
- To download a complete annotation for the P. vivax genome, click on GFF under Download Genome Annotation
3. Both files will be downloaded to your chosen folder into your local computer.
Step 7: Screen capture of researcher's CoGe MyData tab
4. Go to CoGe or follow this link: https://genomevolution.org/coge/
5. Login into CoGe.
6. Click on the MyData section on the upper left part of the screen. This will lead to the Data section of your personal CoGe page. This section will fill up as genomes of interest are uploaded into CoGe.
7. On the upper left section of the screen, click the NEW button and select New Genome from the dropdown menu.
Step 8: Screen capture of Create New Organism window at CoGe. Notice the different name of the selected strain and the one written under "Name"
8. Once on the Create a New Genome window has appeared, information about the organisms' taxonomy and genome's origin must be entered. Keep in mind that depending on the type of organism being uploaded, taxonomic information might not have been incorporated into CoGe just yet (e.g. a private species of strain). If this is the case, make sure to create a new organism by following these steps:
a. Click on NEW on the "Organism:" section
b. On the Search NCBI box type the scientific name of the organism to be uploaded. If the organism of interest is not on NCBI yet, select its closest taxonomic relative. In the case of Plasmodium, several strains might be available for a given species (particularly P. vivax and P. falciparum), make sure to select the correct strain or, if a new strain is being uploaded, to add the new strain's name.
c. Click Create
9. After successfully creating a new strain/genome, is time to include any additional information that might be needed in the future as well. Depending on the number of versions for the selected genome already available at CoGe, a different number will be typed on Version. Thus, it is important to check the latest genome version available on CoGe before importing a new version of the same genome (e.g. P. falciparum currently has 5 versions, so any new version incorporated should be number 6). Under the Type section, select the adequate sequence type from the drop down menu (most sequences can be identified as unmasked, Masked). Select the Source in the next dropdown menu (in this case the source is NCBI, but other databases as well as Private sources are also available). Finally, tick the check box if you desire your genome to be Restricted. Remember that:
- Restricted genomes can only be seen and analyzed by the user and those with whom the genome has been shared.
- Unrestricted genomes are available to anybody using CoGe.
10. Click Next
11. This new window allows you to import genome files by using four different strategies: first, data can be imported directly from the Cyverse Data Store (if the data is not already on the Data Store it can be easily imported from CoGe afterwards); second, creating an HTP/FTTP link directly to the data; third, Upload the data from a private computer, and fourth, importing the data using GenBank accession numbers. We will be continuing this example by using the Upload option.
12. Select the downloaded genome file and wait for it to be read by CoGe, once the process is completed select Next. Note that you should select your FASTA, FST or FAA file and not the GFF file (genome annotation).
13. Click Start on the next screen to begin the upload.
14. Once the file upload has concluded all information included by the user, as well as any specifics regarding the genome FASTA file itself, will be visible in the Genome Information page. Note that genomes in earlier stages of assembly (e.g. Scaffolds) can be easily uploaded into CoGe by this method.
Step 16: Complete genome and annotation upload into CoGe
15. At this point, genome annotation files can be also uploaded into CoGe for that same genome. These files can be included by clicking on the green Load Sequence Annotation button under the Sequence & Gene Annotation menu. Note that some limited analyses can be performed in CoGe even when genome annotation data is not yet available. Also, any specific upload can be updated at any point in time in CoGe. Thus, genome annotation data, metadata or experimental data can be included for a genome already imported into CoGe as soon as they become available.
16. The process to importing annotations is similar to that of importing genomes. Under the Describe your annotation page, select the version and source of the annotation data and click Next. As previously described, the data can be uploaded directly from the Cyverse Data Store, by creating an HTP/FTTP link, or by using the Upload option. Note that both GFF and GTF files can be used for uploading genome annotation data from a private computer. Click Next and the annotation data associated to the genome will be imported onto CoGe. This information should now be visible on the Genome Information page under the Sequence & Gene Annotation menu. For more details about uploading genome annotations follow this link: LoadAnnotation


Step 1: Screen capture of NCBI chromosome section under the P. chabaudi genome tab on NCBI

Importing genomes using the "NCBI/Genebank" method

You can also specifically upload certain chromosomes and organelles into CoGe. The following steps show how to import chromosomes one by one onto CoGe from NCBI/GenBank:
1. Go to the genome database on NCBI/GenBank and type "Plasmodium" on the search box. You can select any genome of interest, but in this example we will use that of P. chabaudi (AS strain).
2. Find the Reference Genome section in the lower portion of your screen. Here you will find the RefSeq and INSDC numbers for each chromosome and, if available, organelles.
3. Follow steps 4 through 10 from the previous section.
Step 4: Screen capture of genome upload to CoGe using GenBank ID numbers
4. Select the GenBank accession numbers option. Type or Copy/Paste the INSDC numbers for each Plasmodium chromosome (or for specific Plasmodium organelles) and click the Get button after each time. Information from each imported genome should appear under Selected file(s). Once all genomes have been imported (14 chromosomes in the case of Plasmodium), click on the Next button.
5. After the genome has been imported, all information included by the user, as well as any specifics regarding the genome FASTA file itself will be visible in the Genome Information page. Note that uploading chromosomes/genomes using this method also imports any information of genome annotation already included in NCBI/GenBank. Also, note that genomes uploaded using this method will be unrestricted, and thus, are visible by all CoGe users.


Exporting genomes from CoGe to Cyverse

Data can be exported into Cyverse for easy sharing and storage after it has been imported onto CoGe. While this is not needed to use CoGe or perform any analyses, it is a highly recommended step for complete and Certified genomes (those which represent the latest and most complete version of a given species' genome up to date). You can use CoGe to export data into the CyVerse Data Store by following these steps:
1. While logged into CoGe, go to the Genome Information page on your genome of interest.
2. Under the Tools menu, find the Export to CyVerse Data Store option. Click either on the FASTA or the GFF file options to upload genomic data and its annotation, respectively. Make sure to specify a name for the GFF file before performing the export.
3. Wait until the export is completed. From this point forward, your FASTA and GFF files data will be also found in the CyVerse Data Store. Note that no modification can be performed to the uploaded genomes, so it is recommended to keep a list of the uploaded genome codes that is provided by CyVerse and their associated organism or strain.

Using CoGe tools to perform comparative analyses

Analyzing GC content and other genomic properties (GenomeList)

Step 5: Upload of 12 Plasmodium genomes to Genome List

Initial comparative genomic studies pointed out to significant variations on GC content between Plasmodium species and even within a single genome. The average GC content of P. vivax and P. falciparum, two of the mayor causal agents of human malarias, is 42.3% and 19.4%, respectively. Changes in GC content have also been reported in different regions of P. vivax chromosomes, with subtelomeric regions being largely GC poor; alternatively AT rich regions are widespread in the P. falciparum genome [21]. The evolutionary history of the GC change in Plasmodium has been a topic of interest since the sequencing of the first human malarias. Taking into consideration the proposed order of speciation events within the genus, it has been proposed that the genome of the Plasmodium common ancestor might had been AT rich, a trait which has been maintained in P. falciparum. Therefore, the higher GC content reported in P. vivax and closely related species might actually be a derived trait [22]. Alternative, the AT richness of the P. falciparum genome might also be a trait of the common ancestor of the Laveranian subgenus alone. In order to confidently address these hypotheses, a larger number of Plasmodium species, preferentially those belonging to clades ancestral to the split of the Laveranian subgenus, should be evaluated. Regrettably no fully sequenced genomes for such Plasmodium species are currently available; nonetheless, a more complete perspective of GC content evolution can be obtained thanks to the increasing number of sequenced Plasmodium genomes.

It is possible to calculate the GC content for each Plasmodium genome in CoGe by using the GenomeInfo tool found on Genome Information. GC content will be displayed for genomes imported from GenBank; however, genomes uploaded from private computers or in earlier stages of assembly will not have the GC content information on display. This information can be calculated on the Genome Information page itself and for each specific genome/chromosome, by clicking on %GC on the Length and/or Noncoding sequence sections under the Statistics tab.

However, a better tool to compare GC content and other genomic features across several species/strains imported into CoGe is GenomeList. This tool creates a list of desired genomes and calculates various features for each one of them. A sample of the features which can be comparative evaluated using GenomeList includes: amino acid usage, codon usage, CDS GC content, and genome features such as number of genes, introns, etc. In addition, this list also summarizes genome information included by the user: sequence type, sequence origin, taxonomy, provenance, version uploaded to CoGe, etc.


The following steps indicate how to perform comparative analyses using the GenomeList tool in CoGe:

Step 7: Genome List used to compare 12 Plasmodium species. Note that some columns are not on display. Link to this analysis: https://genomevolution.org/r/lys1

1. Go to: https://genomevolution.org/coge/ and login into CoGe

2. In the main CoGe page find the Tools tile and click on Organism View. You can also follow this link: https://genomevolution.org/coge/OrganismView.pl

3. Type the scientific name of the organism of interest on the Search box and select the desired version of the uploaded genome.

4. Find the Genome Information tile on the right side of the screen. Under Tools find and click on Add to GenomeList. This will automatically generate a new window indicating that the selected genome has been added to a list.

5. Without closing this window, type the scientific name of other organisms of interest on the Search. Once you have selected your second genome click on Add to GenomeList. The second selected organism should appear on the same window alongside the first. You can add as many genomes as desired by using this method.

6. Once you have included all your genomes of interest click on the green Send to Genome list button.

7. A new window should appear after a couple of seconds containing all your selected genomes in a table. Different features and information can be calculated and compared here including information related to the uploaded genome. Most importantly, each genome has quick links that allow you to perform certain calculations (amino acid composition, %AT, etc.). Keep in mind that it is possible to perform specific analyses on certain genomes, but you can also perform the same analysis for all genomes on the GenomeList by clicking on the green Get All section below each column's tittle. Depending on the number and quality of the included genomes, this calculation might take a couple of minutes. Note that by clicking on the Change Viewable Columns green button on the upper right part of the screen it is possible to select the displayed columns on the screen.

8. It is possible to download information from the selected genomes under a variety of formats using "Send Selected Genomes to". Note that the information downloaded will correspond to the genomes themselves and not to the calculations and analyses performed on GenomeList.


We have used GenomeList to compare 12 Plasmodium species with largely complete genomes. Results show that species closely related to P. falciparum share equally AT rich genomes. Moreover, GC content appear to gradually increase both on more recently divergent clades (rodent and simian), as well as within species from the simian clade. P. vivax, P. cynomolgi and P. knowlesi show the highest %GC out of all analyzed species with P. vivax surpassing these species by at least 6%. These results are in agreement with previous suggestions that GC content is currently undergoing a reversal on recently diverging Plasmodium species. It has been proposed that the increment of GC content in P. vivax, while maintaining GC poor subtelometic regions might be indicative of an efficient genome organization [23]. Interestingly, GC content was shown to be markedly low on P. malariae (another species causing human malaria) compared to other species of the simian clade, suggesting that this genome might have similar GC content organization to that seen on species from the Laveranian subgenus. It should also be noted that none of the mayor human malarias showed identical GC content, and thus are likely to showcase different GC content organization. Therefore, while GC content does have an important role on the development and maintenance of variability in these genomes, particularly on regard to antigenic variation [24], it does not seem to be associated to the infection of specific hosts types. Moreover, these patterns are indicative that the mayor four human malarias might follow different evolutionary strategies more strongly related to their phylogenetic relations than to their host.


Identifying gene homologs (CoGeBlast)

Screen capture of CoGeBlast input window. Genomes of interest and the query sequence are shown

Broadly speaking, genes belonging to the Plasmodium core genome and those unique to certain clades or species showcase different elements of the Plasmodium evolutionary panorama. There is no question then, that a significant step on the study of a genome, or even a group of genes, is the correct identification of homologous sequences. In this regard, the identification of multigene family members poses a particular challenge in the study of Plasmodium evolution. Multigene families are formed by two types of genes: orthologs (homolog genes related by speciation events), and paralogs (homolog genes related by duplication events). Within the genus Plasmodium, multigene families perform a wide variety of functions, showcase unique evolutionary patterns, and present diverse genomic arrangements. While many families members are arranged in tandem and can be easily associated with regions of microsynteny loss, others show far more complex patterns. A particular challenge is presented by subtelomeric families associated with antigenic variation and immune evasion (var, stevor, rifin in P. falciparum and closely related species, and pir on P. vivax and closely related species) which can have members distributed across the genome, and also present rapid sequence variation making difficult to identify homologies that aid the establishment of ortholog/paralog relations. [25][26][27][28]

In this sense, BLAST tools which can identify multigene family members and permit the easy visualization of homolog regions between two or more genomes can have a large impact on the study of complex Plasmodium multigene families. We will evaluate how one of CoGe tools (CoGeBlast) performs when identifying multigene family members and indicating their location across several Plasmodium genomes. In the following example, we will attempt to identify members of one of the most extended and challenging multigene super families within Plasmodium: vir. [29][30]

Screen capture of CoGeBlast output. The relative position of hits to the query sequence is shown for the PO1 and Salvador-1 P. vivax strains

The vir super family is composed by 313 members [31] with paralogs being grouped into 10 different subfamilies or remaining independent according to sequence similarity analyses [32]. Previous studies have found that less than a third of vir genes are found in other P. vivax strains demonstrating how rapidly evolving is this family. Nonetheless, this same study also found 15 vir genes shared across all P. vivax strains, which also presented low sequence polymorphism; particularly, one of these genes (PVX_113230) was found to share higher sequence similarity than any other family member and even maintain synteny in other Plasmodium species [33]. We have used this gene as a query sequence for the following example.


The following steps show how to use theCoGeBlast tool in the CoGe platform:

1. Go to: https://genomevolution.org/coge/ and login into CoGe.

2. In the main CoGe page click on CoGeBlast under the Tools tile (Alternatively, you can follow this link: https://genomevolution.org/coge/CoGeBlast.pl).

3. Type the scientific name of your Organism of interest on the Search box under Select Target Genomes. All organism and genomes with names matching the search term will appear under the Matching Organisms menu. Also, any Notebooks matching the term will appear in a new window named Import List.

4. Select all the organisms of interest by using Crtl+click or Command+click and then click on the green + Add button. The added organisms will appear on the Selected Genomes menu on the right. Alternatively, you can select any of the Notebooks found on Import List, and all genomes included in it will be automatically selected.

5. Copy the query sequence in FASTA format on the Query Sequence(s) section at the bottom of the screen. If desired, the BLAST analysis itself can be modified by changing the BLAST Parameters.

6. Once the analysis has been completed the output will include: a table showing the number of hits to the query sequence in the analyzed genomes, a graphic depiction of the location of these hits on the genome, and a list showing information for each hit including their similarity index.


In agreement with previously results, we found PVX_113230 to be highly conserved across P. vivax strains [34]. Interestingly, there was some small variation on the number of reported homologs across strains within the analyzed subfamily, with Mauritania, PO1, and the Salvador-1 showing the largest numbers of homologs. This shows that even within conserved subfamilies, the vir superfamily is highly diverse. Even more, comparison on the location of sequence hits between the PO1 ( not available for inclusion on the previous study) and the Salvador-1 strains, show highly conserved synteny. A comparison between the two strains shows hits located in the approximate same positions and, unless absent, on the same chromosomes. This pattern can also be observed, in less detail, on other strains with an scaffold assembly level. These results could suggest that while PVX_113230 appears to be the founder of the vir superfamily and perform an ancestral role, other regions or member of the family could also have functions outside the stablished role on immune evasion (Results for this test can be replicated following this link: https://genomevolution.org/r/lyvj). As expected, when using another vir member outside this subfamily, the number of family members and their chromosome location varies largely across P. vivax strains.


Identifying microsyntenic regions (GEvo)

Different patterns of genome evolution can be observed among closely related species or even at an intraspecific level. These small genome rearrangements cause loss of synteny between a few genes (Microsynteny), even if gene positions in large portions of a chromosome tend to be preserved. For instance, within Plasmodium microsynteny can be loss in regions of high recombination frequency or where rapid genomic changes are evolutionary advantageous. Thus, change in microsynteny can point out to regions of great evolutionary interest where a more detailed evaluation might be informative. In addition, careful assessment of genome properties on defined regions can aid in hypotheses testing of evolutionary gene origins. In this regard, it has been hypothesized that some of the genes essential for successful erythrocyte invasion found in Laverania (the reticulocyte-binding-like homologous protein 5 or Rh5, and the cysteine-rich protective antigen or CyRPA) might have originated via an horizontal genome transfer (HGT) event early on the evolution of the clade [35]. GEvo can be used to identify and visually represent patterns of genome evolution across multiple genomic regions and for any number of genomes, which can aid on the confirmation of this hypothesis.

GC content is shown as green bars in each genome background. The wobble codon GC content of each gene has also been colored


The following steps show how to use GEvo to analyze microsyntenic regions:

1. Go to: https://genomevolution.org/coge/ and login into CoGe.

2. On the main CoGe page click on GEvo under the Tools tile (Alternatively, you can follow this link: (https://genomevolution.org/coge/GEvo.pl).

3. Each displayed box found under Sequence Submission allows you to select a sequence. You can specify as much as 25 sequences to perform a GEvo analysis. Each box is composed of several elements: a drop down menu of sequence databases (CoGe database, NCBI GenBank or Direct Submission), the name of the selected sequence (e.g. gene ID numbers), the length of genome segment to be displayed to the left and right of the sequence, and green button used to specify additional Sequence Options (e.g. skip sequence from the analysis, set sequence as reference, set sequence as reverse complement, or mask the sequence). You can input sequences for analysis by entering their corresponding IDs on the Name: bar. Alternatively, you can select pairs of genes for microsynteny analysis by zooming or clicking on specific regions of the SynMap display.

4. Once you have selected sequences for each display box (you can simply se), click on the red Run GEvo button at the bottom of the screen.

5. The GEvo analysis will output a display of the syntenic regions between the compared genomes. Genes are displayed in green on their corresponding genome position. Syntenic genome regions are signaled as light colored red bar on top of each genome. These bars can be clicked to display connectors to the corresponding regions on all analyzed genomes.

6. You can modify the analysis by changing the parameters displayed on the Algorithm tab. Also, you can modify the information of the graphical display by altering the options on the Results Visualization Options tab. We have modified the display to show GC content (GC rich regions are shown as green in the background and AT rich regions are shown in white) and wobble GC content (indicated by a colored gradient: low GC content is displayed in red, ~50% GC content in yellow, and high GC content in green).

The analysis shows a region where synteny is loss between P. vivax Sal-1 strain, and the P. vivax PO1 strain and P. cynomolgi genomes.


We searched for Rh5 orthologs in all fully sequenced Plasmodium genomes (P. falciparum strains 3D7 and IT, P. reichenowi strains CDC and SY57, and P. gaboni strain SY75) from the Laveranian subgenus by using CoGeBlast. We then used the provided output to perform a microsynteny analysis of these genome regions using GEvo. Our results show that microsynteny is largely maintained in the regions surrounding Rh5 and CyRPA; furthermore, there does not appear to be a marked difference in GC content inside and outside the region containing these genes for either of the evaluated genomes. It has been suggested that changes in GC content within any certain genome region that do not correspond to the background GC content, or to the GC wobble content of surrounding genes could be indicative of a HGT event. Such pattern is not observed for either Rh5 and CyRPA, and therefore our results do not support the previously suggested HGT event [36]. It is possible that a HGT event occurring between genomes of similar composition might not be detected by this analysis, and thus additional testing might be required. However, differences in topology for specific gene trees respect to those of the species tree might be cause by additional causes other than HGT. In particularly, genes expressed during blood parasitic stages, and involved on erythrocyte invasion, are expected to be largely affected by selective pressures imposed by the host's immune system [37]. Therefore, they could present unique evolutionary patterns not related to HGT. CoGeBlast analysis can be regenerated following this link: https://genomevolution.org/r/m1qw. The GEvo analysis can be run following this link: https://genomevolution.org/r/m4dq.

In addition, GEvo can be used to identify potentially poorly sequence genome regions which can influence the identification of larger syntenic patterns. A microsynteny analysis on border regions where an inversion event has been detected between P. vivax Salvador-1 strain, and P. vivax PO1 strain and P. cynomolgi shows that loss of synteny correlates with the location of a poorly sequenced regions in the P. vivax Salvador-1 strain (shown in orange). Synteny is maintained in the region between P. cynomolgi and the P. vivax PO1 strain, but loss when P. vivax Salvador-1 is compared with either P. vivax PO1 strain or the sister species P. cynomolgi. This suggest that the inversion event observed in P. vivax Salvador-1 might be unique for this genome, or it might indicate an artifact due to poor sequencing of this region.


Performing syntenic analyses between two genomes (SynMap)

Identifying syntenic gene pairs

There are approximately 1787 protein family members thought to have originated after the split of the Plasmodium and Theileria common ancestors [38]. As expected, the number of ortholog genes increases the more closely related two species are [39]. The sequencing of new Plasmodium genomes, provides the opportunity to identify syntenic gene pairs across different paired genome combinations. Not only this, but it also permits the identification of positional changes for certain genes and allows to infer their potential evolutionary origin. Furthermore, it is possible that changes in the sequence of genes in a genome can have an effect not only on the gene of interest but also on neighboring sequences. It has been shown in several eukaryotic organisms that gene expression and gene regulation might be largely dependent on genome location; furthermore, it has been proposed that gene co-expression clusters might be a significant element in the eukaryotic gene regulation programs [40]. Moreover, previous studies have shown that gene expression and transcriptome evolution is affected by genome position [41][42]. While there are comparative less studies that evaluate potential relations between gene co-expression and genomic location performed in Plasmodium genus, there is evidence that certain genes are strictly up-regulated during specific parasite life stages [43]. The identification of syntenic gene pairs in Plasmodium can provide the putative location of functionally advantageous clusters preserved by natural selection, as well as suggest sites of interest for the evaluation of the role that changes in gene order could have on gene expression.

One of the most significant tools found in the CoGe platform is SynMap. This tool is used to identify syntenic ortholog genes between two genomes and provides a graphical output for genes across the entire genome. In Plasmodium, such information can be used to identify highly conserved regions of between two genomes, as well as to identify section where synteny has been loss. This types of regions can be latter analyzed in search of patters suggesting neighboring effects on gene expression and transcription as those described for other eukaryotes.


The following steps can be followed to perform comparative analyses using the SynMap tool on CoGe:

1. Go to: https://genomevolution.org/coge/ and login into CoGe

2. On the main CoGe page find the Tools section and click on Organism View (Alternatively, you can also follow this link: https://genomevolution.org/coge/OrganismView.pl)

Step 5: SynMap input screen. Genomes for two different species are selected as an example: P. cynomolgi B strain (Organism 1), and P. vivax Salvador-1 strain (Organism 2)

3. Type the scientific name of the desired species on the Search box and select the appropriate genome. Then, click on the GenomeInfo link under the Genome Information section.

4. Find the link to the SynMap tool under the Analyze section on Tools.

5. By default, SynMap will allow you to evaluate the synteny of a genome with itself. This can be of use when characterizing a genome or when attempting to identify putative duplication events [44]. Alternatively, two different genomes or two different organisms can be analyzed by using SynMap. A different genome can be selected for Organism 1 or for Organism 2 by typing a different scientific name on either Search before performing the SynMap analysis, and the selecting the desired genome. Once you have selected the organisms to analyze you can run this tool by clicking on Generate SynMap.

6. Once the analysis has been completed, SynMap will output a graphical depiction of the syntenic regions between the two genomes. There are currently two version of the SynMap tool: SynMap2, which is selected by default, allows you to interact and dynamically alter the output (e.g. zoom in into a particular region showing a pattern of interest); and SynMap Legacy, which only provides static images. Either version can be used for the analysis of two genomes and require the same input.

7. Specific gene pairs of interest observed in SynMap can be analyzed in more detail in GEvo. The syntenic gene pair can be selected by zooming on the SynMap plot (this is done by clicking on the region of interest on SynMap Legacy or by dragging the mouse over the region on SynMap2). GEvo can then be run for specific gene pairs by double clicking on their syntenic point (SynMap Legacy), or by selecting this point and clicking on the Compare in GEvo >>> under Point Selection (SynMap2)


Identifying chromosomal inversions, fusions, fissions and other events between two genomes

Independent rearrangement events are observed in these pairwise comparisons with SynMap Legacy. From top to bottom and left to right: P. knowlesi vs. P. malariae; P. coatneyi vs. P. malariae; P. coatneyi vs. P. knowlesi; P. ovale vs. P. malariae; P. coatneyi vs. P. ovale; P. ovale vs. P. knowlesi

Another significant use of CoGe's SynMap tool is the identification of genome rearrangement events. Rearrangements are originated when regions of the genome become duplicated, inverted, when a single region divides into two fragments or when different regions fuse into a single fragment. Furthermore, SynMap can be use to identify indels between two genomes. The tracking of these events can aid in pinpointing genome sections subjected to more rapid change than others, as well as to identify the evolutionary origin of certain genomic elements.

Initial studies evaluating synteny conservation across species from the phylum Apicomplexa have shown that while synteny amongst genera is for the most part lost, gene order and position is highly maintained within the Plasmodium genus. Nonetheless, as a larger number of Plasmodium genomes become available, it has become apparent that synteny patterns within the genus are far more complex. Closely related Plasmodium species have shown to be largely syntenic with the exception of determined genomic regions; on the other hand, less closely related species from different Plasmodium clades are less likely to maintain synteny with numerous rearrangement events being apparent [45]. Thanks to the larger number of Plasmodium genomes currently available, it is possible to evaluate Plasmodium synteny in a more complex array of species and within three of the four mayor Plasmodium clades. At this point in time is possible to estimate species-specific genomic rearrangements events and assess their significance on genome evolution, as well as to identify the potential evolutionary origins for most prominent rearrangements by performing several paired comparisons across different species sets.

In the case of P. vivax and closely related species, loss of synteny events have been reported on chromosomes 3 and 6 between: P. vivax, P. cynomolgi and P. knowlesi . An analysis of these species using SynMap Legacy shows inversion events between P. vivax and both P. knowlesi and P. cynomolgi. Nonetheless, no inversion events are observed between P. cynomolgi and P. knowlesi. This suggest that the chromosomal inversions reported for chromosomes 3 and 6 might have occurred after the split of P. cynomolgi and P. vivax (approximately between 3.43-3.87 Mya) and can be unique of the P. vivax genome [46]. Analyses can be regenerate following these links: https://genomevolution.org/r/lj12 (P. vivax vs. P. cynomolgi), https://genomevolution.org/r/lj1x (P. knowlesi vs. P. cynomolgi), and https://genomevolution.org/r/lj1t (P. knowlesi vs. P vivax).

It is also possible to identify sets of chromosome fusion/fission events unique to specific genomes. Pairwise comparisons between the genomes of four closely related Plasmodium parasites: P. ovale curtisi, P. malariae, P. coatneyi and P. knowlesi; show that at least two sets of inversions and fusions have occurred in the P. coatneyi and P. malariae genomes. SynMap Legacy results show two fusion events in chromosomes 5 and 9 unique to P. malariae (marked with red squares) and two additional fusion events in chromosomes 13 and 14 of P. coatneyi (marked with green squares). Moreover, and inversion event can be observed in the central region of chromosome 4 in P. malariae (marked with a red circle). Analyses can be regenerated using the following links: P. knowlesi vs. P. malariae (https://genomevolution.org/r/lq5x); P. coatneyi vs. P. knowlesi (https://genomevolution.org/r/lj2b); P. coatneyi vs. P. malariae (https://genomevolution.org/r/lq5y); P. ovale vs. P. malariae (https://genomevolution.org/r/lq5t); P. coatneyi vs. P. ovale (https://genomevolution.org/r/lq65); and P. ovale vs. P. knowlesi (https://genomevolution.org/r/lq5v).


Measuring Kn/Ks values between genomes (SynMap - CodeML analysis tool)

The relative rates of synonymous (Ks) and non-synonymous (Kn) substitutions are a measure of the amount of change between two genomes. Ks values are largely neutral or can be under low selective pressure; thus, they can be used to measure mutation rates and to establish relative gene age. Alternatively, Kn values are largely indicative of the effects of natural selection on any given gene. As a whole, the Kn/Ks ratio provides a picture of some of the evolutionary forces shaping gene evolution. Under neutrality, it is expected that Kn/Ks = 1 since both synonymous and non-synonymous substitutions will occur at the same rate. Positive selection is indicated by a larger ratio of non-synonymous substitutions (Kn/Ks > 1), while purifying selection is observed when there is a larger ratio of synonymous substitutions (Kn/Ks < 1). The CoGe platform has the unique capability of calculating the Kn/Ks ratio on syntenic gene pairs; this means that it can provide a measure of the role of natural selection on gene evolution that is informed of the relative position of genes on the genome. Therefore, syntenic based Kn/Ks analyses aid to define genome regions evolving under different selective regimes than those predominant on the entire genome, identify the relative age of genome rearrangement events (e.g. duplications), and establish genome-specific difference in genome evolution from the point of their split from the common ancestor. All these elements are highly significant on the study of Plasmodium evolution given that different species have been shown to present distinct evolutionary patterns. For instance, several studies have pointed out how Plasmodium subtelomeric regions have a tendency to show higher recombination rates and overall more rapid evolution than others regions of the genome, and in comparison with other Apicomplexa parasites [47].

In the CoGe platform, Kn/Ks analyses can be performed for two annotated genomes after a SynMap analysis has been completed. The analysis is performed by using one of the available SynMap Tools and will modify the Syntenic_dotplot display to represent the distribution of the different Ks, Kn or Kn/Ks ratio.

Paired Ks analyses between Plasmodium species of the Laverania subgenus. From right to left: P. gaboni vs. P. reichenowi; P. falciparum vs. P. reichenowi; P. gaboni vs. P. falciparum


The following steps show how to perform Kn/Ks analyses using the CodeML tool available on SynMap:

1. Go to: https://genomevolution.org/coge/ and login into CoGe.

2. Follow the steps to perform a SynMap analysis between the two genomes of interest. Keep in mind that CoGe has the capacity to store all analysis performed under a users' account, so you can use a previously generated SynMap analysis. Also note that, the Kn/Ks ratio can only be calculated for genomes with included annotation (.gff files have been imported) on CoGe regardless on their levels of assembly.

3. Once you have the SynMap output for the two sequences, find the CodeML tool under the Analysis Options tab at the bottom of the screen. Click on the Calculate syntenic CDS pairs and color dots:________ substitution rates(s) section and select Synonymous (Ks) from the dropdown menu. You can also perform other analyses by selecting the: Non-synonymous (Kn) and (Ks/Kn) analysis options. The display can be modified by choosing a different Color Scheme from the second dropdown menu, or by specifying the axis default Min Val. or Max Val., and the Log10 Transform. of the data.

4. The resulting output will show the distribution of Ks values (or Kn or Ks/Kn) across the syntenic regions between the two evaluated genomes displayed on SynMap. In addition, the output will include a Histogram of Ks values (or Kn or Ks/Kn) bellow updated SynMap. In SynMap2, specific regions/chromosomes can be dynamically selected in order to view the Ks, Kn or Ks/Kn values across the a particular set of syntenic genes.


Paired Kn analyses between Plasmodium species of the Laverania subgenus. From right to left: P. gaboni vs. P. reichenowi; P. falciparum vs. P. reichenowi; P. gaboni vs. P. falciparum

Smaller Log10( ) substitution per site values of ___ are indicative of a lower number of synonymous (Ks) or non-synonymous (Kn) substitution between the analyzed genomes. Since the effects of Natural Selection on synonymous substitutions is thought to be minimal, these types of substitutions are expected to accumulate in a largely constant manner. Paired Ks analyses performed between different genome sets provide information regarding their time of divergence and mutability. The Ks analyses between P. gaboni strain SY57 and P. reichenowi strain CDC show a larger number of recent synonymous substitution compared to the same analysis performed between P. gaboni - P. falciparum strain 3D7. This is an interesting result since, P. reichenowi and P. falciparum are thought to have recently split (approximately 5.28-5.93 Mya [48]), while they share a distant common ancestor with P. gaboni [49]. The dissimilarities between Ks rates in P. falciparum and P. reichenowi respect to P. gaboni, suggest that a change in synonymous substitution rates has occurred after the split of these sister taxa. It would be expected that if this change occurred in the common ancestor of both species with P. gaboni, synonymous substitution rates would be similar when each one is compared to the ancestral P. gaboni, which is not the case. Furthermore, the Ks values between P. reichenowi - P. falciparum are slightly smaller than those observed between P. falciparum - P. gaboni supported the observation that Ks rates have increased in P. reichenowi after its split from P. falciparum, but there was largely little variation on the substitution rate after the split of the common ancestor for both species from P. gaboni. This suggests that syntenic genes within P. reichenowi strain CDC are evolving at a more rapid rate than other compared species within the Laveranian subgenus. These analyses can be replicated in the following links: P. reichenowi vs. P. falciparum (https://genomevolution.org/r/ljhj), P. reichenowi vs. P. gaboni (https://genomevolution.org/r/ljhq), and P. falciparum vs. P. gaboni (https://genomevolution.org/r/ljhl).

Alternatively, the pattern of non-synonymous (Kn) substitution observed between P. gaboni - P. falciparum and P. gaboni - P. reichenowi are largely similar which suggest that a number of non-synonymous have occurred after the split of the common ancestor of both species from P. gaboni. Moreover, the smaller rate but more recent number of non-synonymous substitutions observed between P. falciparum - P. reichenowi indicate a number of non-synonymous substitutions unique for each species. Overall, these results indicate that natural selection has have a role on shaping the divergence between these three genomes in a pattern likely associated to the corresponding colonization to different vertebrate hosts (e.g. human vs. chimps). Previous studies have shown that the non-synonymous substitution rates between P. reichenowi and P. falciparum are particularly large in a significant number of proteins; and that a selective pressure and gene gain/loss events are largely predominant during erythrocyte invasion. These previous results suggests that stages associated with erythrocyte invasion have had a fundamental role on the expansion of the Laveranian subgenus [50], and that some colonization of humans by P. falciparum might have been facilitated, at least in part, via the genome transfer of several key erythrocyte invasion proteins [51]. While our results are in agreement with the significant role of natural selection on the evolution of the Laveranian subgenus, they also point out to intrinsically different mutation patterns between P. reichenowi and P. falciparum. Analyses can be run following these links: P. reichenowi vs. P. falciparum (https://genomevolution.org/r/lsz2), P. reichenowi vs. P. gaboni (https://genomevolution.org/r/lsyy), and P. falciparum vs. P. gaboni (https://genomevolution.org/r/lsz5).


Identifying sets of syntenic genes amongst several genomes (SynFind)

Screen capture of GEvo analysis using the output from Synfind. Lines connect syntenic regions between members of the SERA multigene family

We have observed that a significant level of genome rearrangements is prevalent between Plasmodium clades and even within species inside a single clade. A large number of events leading to loss of synteny are associated to species-specific gene gain/loss events; moreover, high recombination rates can result in gene duplication being apparently located outside their point of original, a pattern also consistent with horizontal gene transfer occurs (HGT). In this regard, it is of particular significance the use of tools, which allow the identification of syntenic regions across genomes, and in particular, of those regions where genes of interest might be located. Moreover, the identification of these regions, more than that of the gene of interest itself, can provide indispensable information regarding the gene's origin and trajectory. Within Plasmodium, the characterization of syntenic regions where multigene family members are found can aid in the identification of gain/loss events, rearrangements on the order of family members, or even evolutionary relation amongst non coding sequences which can allow the inference of the evolutionary history events leading to the spread, or reduction, of the family. These types of patterns are likely to be observed more predominantly on multigene families with a tandem arrangement on the chromosome; on this subject, a significant example for these patterns within the genus Plasmodium is the SERA multigene family.

Thought the specific details about their functionality is largely unknown, members of the SERA (serine repeat antigen) multigene family are found across all sequenced Plasmodium species. Overall, SERA multigene family members are characterized by encoding proteins with a papain-like cysteine protease motif [52], and are expressed during various stages of the Plasmodium life cycle. One member of this family (SERA-5), produced during late trophozoite and schizont stages, has been a widely considered as a promising target for malaria vaccine development and has reached phase Ib clinical trials (studies conducted in diagnosed patients) [53]. While members of the SERA family have been described in all sequenced Plasmodium genomes, the amount of significant contractions, expansions and rearrangements observed across species pinpoint to a highly dynamic evolutionary history that can be explored with the adequate tools. The SynFind tool in CoGe allows the identification of syntenic regions across any set of genomes after providing a specific query gene and reference genome.

Screen capture of GEvo analysis using Synfind output. Lines connect syntenic regions. Small syntenic fragments are found across intergenic regions


These steps show how to use SynFind to search for syntenic regions associated to particular sets of genes from a reference genome:

1. Go to: https://genomevolution.org/coge/ and login into CoGe.

2. On the main CoGe page click on SynFind under the Tools tile (Alternatively, you can follow this link: (https://genomevolution.org/CoGe/SynFind.pl).

3. Type the scientific name of your desired organism on the search bar. You will find this bar under the Search tab and on the Select Target Genomes section. Organisms and genomes with names matching the search term will be displayed on the Matching Organisms menu.

4. Select all the genomes of interest by using Crtl+click or Command+click. After you have selected all genomes of interest click on the green + Add button. Added genomes will appear on the Selected Genomes menu on the right.

5. Type the Name, Annotation or Organisms of interest in the Specify Features section. It is recommended to provide as many specifics for this query as possible; nonetheless, you should also be capable of performing the analysis even with less specific terms. For example, it is possible to retrieve the sequences of interest just by typing "sera" on the box corresponding to Name. Once you have specified your features, click on the green Search button.

6. All matches to the search term and genome where they have been found will appear as an output in a drop down menu within the same section. Select all relevant Matches (e.g. all SERA genes), and your reference Genome (e.g. P. falciparum strain 3D7 v5).

7. Once you have specified your feature click the red Run SynFind button to start the analysis (You can regenerate this example using the following link: https://genomevolution.org/r/lszj)

8. SynFind will output all syntenic regions to the query sequence found on the reference genome and their Syntenic depth. Using this output, sequences can be further analyzed by using any of the numerous tools available on CoGe (generate SynMap dotplots for matches, perform a microsynteny analysis for these regions with GEvo, etc.).


The information provided by SynFind allows to rapidly identify regions where multigene family paralogs can be found. Then, GEvo can be used to evaluate the identified syntenic regions in detail. We used Synfind to identify potential syntenic regions to SERA-5 across six P. vivax strains from different geographic regions (analysis can be recreated following this link: https://genomevolution.org/r/lszj). Our results show that all evaluated P. vivax strains share the 12 reported SERA paralogs [54]; however, there is some intraspecific variation between the syntenic regions where SERA paralogs are found. Specifically, synteny is loss on certain family members on the P. vivax Brazil-1 strain (shown as second from the upper part of the screen). The regions where synteny is loss are associated with the location of paralogs uniquely found in P. vivax and closely related species. Therefore, it is possible that recently duplicated paralogs might have not yet been fixated at the intraspecific level, or that there are certain evolutionary advantages associated with a variable number of paralogs within the same species as it has been previously discussed in other Plasmodium multigene families [55]. Nonetheless, it is important to note that while multigene family members are characterized by a family common motifs, such motifs can be occasionally found in genes non related to the family and evolving under a different patterns and mechanism. Thus, motifs and domains identified by SynFind, can be conserved across different types of genes or even intergenic regions, and therefore should be carefully evaluated.


Identifying codon and amino acid substitution frequencies (CodeOn)

Amino acid usage tables in Plasmodium species from the simian clade

The evolutionary significance of compositional biased mutational pressure on codon and amino acid usage within the genus Plasmodium has been previously highlighted. The compositional bias observed on P. falciparum has been associated with variations on codon usage and gene expression, and in particular, preferences for C-ended codons has been observed in many highly-expressed genes despite this parasites' AT rich genome. Moreover, expression patterns have also been associated to usage of less energetically expensive amino acids, which could suggest that translational selection creates an evolutionary advantage for decreasing energetic costs during infection [56]. The significance of compositional bias and translational selection has also shown to be largely variable on other Plasmodium species; in particular, translational selection has been shown to have a small, yet higher than P. falciparum, role on codon usage bias for P.vivax [57].

The role of compositional bias has been evaluated on only 6 Plasmodium species representing three of the four mayor Plasmodium clades. Currently, the large number of Plasmodium genome sequenced allow us to assess the role of composition bias on closely related species which also share similar nucleotide composition. In order to assess differences in codon and amino acid usage potentially associated with GC content across Plasmodium species we will use one of CoGe analysis tools named CodeOn, which calculates amino acid usage across various levels of %GC for any given genome, and the number of CDS under the computed %GC tiers.

Amino acid usage tables in Plasmodium species from the Laveranian subgenus


The following steps indicate how to built amino acid usage tables for any given genome:

1. Go to: https://genomevolution.org/coge/ and login into CoGe.

2. Find your organism and genome of interest in Organism View (https://genomevolution.org/coge/OrganismView.pl).

3. Find the Genome Information section on the right side of the screen. Under the different listed Tools you will find CodeOn. Click on the analysis, the output will be shown in a different tab once completed after a couple of minutes.


As expected, similarities on %GC were more prevalent amongst closely related species than species from different Plasmodium clades. Within the simian clade, P. vivax showed a large number of CDS with 45-55% GC, while other species presented a slightly more skewed 40-45% GC on most CDS. Alternatively, Plasmodium species of the Laveranian subgenus show a larger number of CDS with a reduced 20-30% GC. Nonetheless, CodeOn results show that the patterns of amino acid usage in relation to the variations on GC content are still unique for each Plasmodium species. Interestingly, P. vivax and P. coatneyi showed higher similarities in their amino acid usage trends than with their sister taxa (P. cynomolgi and P. knowlesi, respectively). Even more, these differences did not appear to be solely related to composition genome bias, since in both cases GC content was more similar amongst sister taxa. These results suggest that amino acid usage is likely influenced by elements other than compositional bias in other Plasmodium species from the simian clade. Taking into account previously reported associations of codon usage and translational selection on P. vivax, it would be relevant to explore is similar relations are observed in other newly sequenced Plasmodium genomes.

In the case of Plasmodium species from the Laveranian subgenus, the sister species P. falciparum and P. reichenowi showed both similar amino acid usage, and number of CDS under low %GC tiers. On the other hand, the earlier divergent species P. gaboni showed similar %GC patterns but dissimilar trends in amino acid composition. The likeness in the patterns observed among Laveranian species confirms that compositional bias is a significant factor on determining amino acid usage within the subgenus; however, and similarly to species on the simian clade, additional elements also appear to play a role in determining amino acid composition. While difficult to assess using only three representative species genomes, it is possible that these changes in amino acid usage might have originated in specific points during the diversification of the Laveranian subgenus; specifically, the skewed amino acid usage observed on P. reichenowi, and more predominantly on P. falciparum, could represent a recently derived trait associated to the infection of a different host type and might have occurred in their common ancestor after the split from P. gaboni and other Laveranian species.


Using Syntenic Path Assembly (SPA) to make analysis of poor or early genome assemblies easier (SynMap - SPA tool)

Syntenic Path Assembly (SPA) window analysis

While the Plasmodium genome panorama has become more complete in recent years, there are still a large number of incomplete Plasmodium genomes. These types of genome data originate from different sources: poorly sequenced or assembled genomes, sequencing project which publish genomic information in its earlier stages of assembly, partially sequenced genomes, and genomes unassembled private genomes. A challenge for the sequencing and assembly of Plasmodium genomes is the number of repetitive elements, low complexity sequences, and multigene families which can vary largely between Plasmodium species and even among chromosome regions. Therefore, even with the use of reference genomes and the widespread usage of novel sequencing techniques, the assembly of Plasmodium genomes can be a complex task [58]. While unassembled genomes can be used in multiple types of studies (e.g. calculating the polymorphism on specific genes or genome regions), the information that they provide in more complex comparative genomics analyses can be limited.

Syntenic Path Assembly (SPA) of P. inui contigs using P. coatneyi genome as a reference

Hereof, tools capable of identifying syntenic orthologs to a reference genome can be used to provide preliminary genome assemblies and allow the identification of genome elements of interest. One of CoGe tools, the Syntenic_path_assembly or SPA, provides a quick genome assembly based on any selected reference genome. This tool can be used with any incomplete assembly in order to provide information about the syntenic regions between two genomes as illustrated by SynMap. Alternatively, SPA can also be used to correctly orient syntenic regions which have been annotated using reverse DNA strands (this functionality is fundamental for the accurate identification of inversion events and prevention of data miss interpretation). We will use the SPA tool to assemble the P. inui genome (currently on scaffold) against the complete P. coatneyi genome.


The following steps shows how to use the SPA tool found in SynMap:

1. Go to: https://genomevolution.org/coge/ and login into CoGe

2. Run a SynMap analysis between a completely sequence genome and an incomplete genome assembly. You can revise previous sections of this manuscript for instructions on how to run SynMap.

3. Once the SynMap has been generated find the Display Options tab. Find the SPA tool at the bottom of the screen. Select the tool by clicking on the check mark next to: The Syntenic Path Assembly (SPA)?

4. After a few minutes (depending of the number of contigs), the incomplete genome will be assembled using the second genome as a reference.


Note that while using SPA allows you to observe syntenic regions between the two genomes to a certain degree there are some significant limitations regarding its assembly interpretation. For one, the incomplete genome will be assembled using a reference provided by the user. This means that contigs will be arranged on SynMap in a way that allows the largest level of synteny between the incomplete genome and the selected reference. Thus, it is evident that the assembly of contigs will not be the same when different reference genomes are used. For instance, P. inui genome can be assembled using P. coatneyi (a closely related species) or P. falciparum (a species from the Laveranian subgenus). In both cases, the synteny of the incomplete genome displayed on SynMap will be maximized, even though significant rearrangement events are evident when these two complete genomes are compared. Therefore, SPA reference genomes should be selected after consideration of the biological and evolutionary relation between species.

Another element of care should be the identification of rearrangement events such as inversions or duplications. Various contigs can potentially be syntenic to a same region and be incorrectly identified as a duplication event; on the other hand, contigs could have been annotated using a reverse DNA strand, showing a pattern which can be incorrectly identified as an inversion. Both potentially misinterpreted events are illustrated in the SPA assembly of P. inui using P. coatneyi genome as a reference using black circles. The analysis can be replicated using the following link: https://genomevolution.org/r/ljen


Overall conclusion

By comparatively analyzing genomes with different levels of relation within Plasmodium, it is possible to understand the origins and evolutionary forces shaping significant genome elements. The number of available Plasmodium genomes has increased markedly during recent years providing an unprecedented opportunity to understand evolution on this genus. Furthermore, the unique qualities of the different Plasmodium genomes can be explored in detail.

Thanks to worldwide efforts, there has been a large reductions in the number of malaria cases and deaths between 2000 and 2015. By 2015, it was estimated that the number of malaria cases had decreased from 262 million to 214 million, and the number of malaria related deaths from 839,000 to 438,000 [59]. While this is an enormous achievement for malaria treatment and control strategies, human infectious of P. cynomolgi [60] and P. knowlesi [61] have been reported on SouthEast Asia. In addition, various Plasmodium species from the Laveranian subgenus, including P. falciparum strains, have been found in African primates [62][63] suggesting a potential role of wild primates as malaria reservoirs. Both examples illustrate the plasticity of the Plasmodium genome, where species barrier are more likely to be breached than we would desire. In this regard, Plasmodium related studies should not only be focused on those species of major human interest, but also partially devote to gain a better understanding of the evolution in the genus. Thus, the use of platforms like CoGe, where genomes can be easily imported, analyzed, visualized and made public represents an essential step in furthering comparative genomes in the Plasmodium genus.

We have used the different tools available on CoGe to successfully test various hypotheses significant for understanding Plasmodium evolution. In addition, we have use this platform to further characterize both general and specific genome elements on sequenced Plasmodium species and strains. In order to attain an even more complete panorama on the complex evolutionary history in this genus, genomes from Plasmodium species ancestral to the Laveranian subgenus are required. Evolutionary questions such as the origins on the AT richness observed in the Laveranian subgenus, the potential changes in synteny between mammal and non-mammal infecting Plasmodium species, and the expansion/contraction/origin of multigene families can be more clearly evaluated once these genomes become publicly available, and by their incorporation into the CoGe platform, these questions can be readily evaluated. Overall, our results show that the complexities of the Plasmodium genome can be effectively analyzed in CoGe, and that by doing this, opportunities for furthering our understanding of malaria evolution and developing novel hypothesis are open.


Useful links

Plasmodium Notebooks in CoGe

Link to Notebook for published Plasmodium genome data: https://genomevolution.org/coge/NotebookView.pl?lid=1753
Link to Notebook for published P. falciparum strains: https://genomevolution.org/coge/NotebookView.pl?lid=1758
Link to Notebook for published P. vivax strains: https://genomevolution.org/coge/NotebookView.pl?lid=1760
Link to Notebook for published Plasmodium apicoplast data: https://genomevolution.org/coge/NotebookView.pl?lid=1754
Link to Notebook for published Plasmodium mitochondrion data: https://genomevolution.org/coge/NotebookView.pl?lid=1756

Sample data

Gene sequence used on CoGeBlast analysis (obtained from PlasmoDB):
PVX_003830.1 | Plasmodium vivax Sal-1 | serine-repeat antigen 5 (SERA) (http://plasmodb.org/plasmo/app/record/gene/PVX_003830)
Gene sequences used on CoGeBlast used to inform GEvo analysis (obtained from PlasmoDB):
PF3D7_0424100.1 | Plasmodium falciparum 3D7 | reticulocyte binding protein homologue 5 (http://plasmodb.org/plasmo/app/record/gene/PF3D7_0424100)
PVX_096410.1 | Plasmodium vivax Sal-1 | cysteine repeat modular protein 2, putative (http://plasmodb.org/plasmo/app/record/gene/PVX_096410)


References

  1. Jackson AP. 2015. Preface. The evolution of parasite genomes and the origins of parasitism. Parasitology. 142 Suppl 1:S1-5. https://www.ncbi.nlm.nih.gov/pubmed/25656359
  2. Carlton JM, Perkins SL, Deitsch KW. 2013. Malaria Parasites. Caister Academic Press
  3. Tachibana SI, Sullivan SA, Kawai S, Nakamura S, Kim HR, Goto N, Arisue N, Palacpac NM, Honma H, Yagi M, Tougan T, Katakai Y, Kaneko O, Mita T, Kita K, Yasutomi Y, Sutton PL, Shakhbatyan R, Horii T, Yasunaga T, Barnwell JB, Escalante AA, Carlton JM, Tanabe K. 2012. Plasmodium cynomolgi genome sequences provide insight into Plasmodium vivax and the monkey malaria clade. Nat Genet. 44: 1051–1055. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3759362/
  4. Prugnolle F, Durand P, Ollomo B, Duval L, Ariey F, Arnathau C, Gonzalez JP, Leroy E, Renaud F. 2011. A Fresh Look at the Origin of Plasmodium falciparum, the Most Malignant Malaria Agent. PLoS Pathog. 7: e1001283. http://journals.plos.org/plospathogens/article?id=10.1371/journal.ppat.1001283
  5. Prugnolle F, Rougeron V, Becquart P, Berry A, Makanga B, Rahola N, Arnathau C, Ngoubangoye B, Menard S, Willaume E, Ayala FJ, Fontenille D, Ollomo B, Durand P, Paupy C, Renaud F. 2013. Diversity, host switching and evolution of Plasmodium vivax infecting African great apes. Proc Natl Acad Sci U S A. 110:8123-8. https://www.ncbi.nlm.nih.gov/pubmed/23637341
  6. DeBarry JD, Kissinger JC. 2011. Jumbled Genomes: Missing Apicomplexan Synteny. Mol Biol Evol. 2011 Oct; 28(10): 2855–2871. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3176833/
  7. Sinka ME, Bangs MJ, Manguin S, Rubio-Palis Y, Chareonviriyaphap T, Coetzee M, Mbogo CM, Hemingway J, Patil AP, Temperley WH, Gething PW, Kabaria CW, Burkot TR, Harbach RE, Hay SI. 2012. A global map of dominant malaria vectors. Parasit Vectors. 5:69. https://www.ncbi.nlm.nih.gov/pubmed/22475528
  8. Buscaglia CA, Kissinger JC, Agüero F. 2015. Neglected Tropical Diseases in the Post-Genomic Era. Trends Genet. 31:539-55. https://www.ncbi.nlm.nih.gov/pubmed/26450337
  9. Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. 2016. GenBank. Nucleic Acids Res. 44: D67–D72. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4702903/
  10. Aurrecoechea C, Brestelli J, Brunk BP, Dommer J, Fischer S, Gajria B, Gao X, Gingle A, Grant G, Harb OS, Heiges M, Innamorato F, Iodice J, Kissinger JC, Kraemer E, Li W, Miller JA, Nayak V, Pennington C, Pinney DF, Roos DS, Ross C, Stoeckert CJ Jr, Treatman C, Wang H. 2009. PlasmoDB: a functional genomic database for malaria parasites. Nucleic Acids Res. 37:D539-43. https://www.ncbi.nlm.nih.gov/pubmed/18957442
  11. Logan-Klumpler FJ, De Silva N, Boehme U, Rogers MB, Velarde G, McQuillan JA, Carver T, Aslett M, Olsen C, Subramanian S, Phan I, Farris C, Mitra S, Ramasamy G, Wang H, Tivey A, Jackson A, Houston R, Parkhill J, Holden M, Harb OS, Brunk BP, Myler PJ, Roos D, Carrington M, Smith DF, Hertz-Fowler C, Berriman M. 2012. GeneDB--an annotation database for pathogens. Nucleic Acids Res. 40:D98-108. https://www.ncbi.nlm.nih.gov/pubmed/22116062
  12. Bensch S, Hellgren O, Pérez-Tris J. 2009. MalAvi: a public database of malaria parasites and related haemosporidians in avian hosts based on mitochondrial cytochrome b lineages. Mol Ecol Resour. 9:1353-8. https://www.ncbi.nlm.nih.gov/pubmed/21564906
  13. Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW, Carlton JM, Pain A, Nelson KE, Bowman S, Paulsen IT, James K, Eisen JA, Rutherford K, Salzberg SL, Craig A, Kyes S, Chan MS, Nene V, Shallom SJ, Suh B, Peterson J, Angiuoli S, Pertea M, Allen J, Selengut J, Haft D, Mather MW, Vaidya AB, Martin DM, Fairlamb AH, Fraunholz MJ, Roos DS, Ralph SA, McFadden GI, Cummings LM, Subramanian GM, Mungall C, Venter JC, Carucci DJ, Hoffman SL, Newbold C, Davis RW, Fraser CM, Barrell B. 2002. Genome sequence of the human malaria parasite Plasmodium falciparum. Nature. 419:498-511
  14. Wu H, Zhang Z, Hu S, Yucorresponding S. 2012. On the molecular mechanism of GC content variation among eubacterial genomes. Biol Direct. 2012; 7: 2. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3274465/
  15. Lassalle F, Périan S, Bataillon T, Nesme X, Duret L, Daubin V. 2015. GC-Content Evolution in Bacterial Genomes: The Biased Gene Conversion Hypothesis Expands. PLoS Genet. 11: e1004941. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4450053/
  16. Šmarda P, Bureš P, Horová L, Leitch IJ, Mucina L, Pacini E, Tichý L, Grulich V, Rotreklováa O. 2014. Ecological and evolutionary significance of genomic GC content diversity in monocots. Proc Natl Acad Sci U S A. 111: E4096–E4102. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4191780/
  17. Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW, Carlton JM, Pain A, Nelson KE, Bowman S, Paulsen IT, James K, Eisen JA, Rutherford K, Salzberg SL, Craig A, Kyes S, Chan MS, Nene V, Shallom SJ, Suh B, Peterson J, Angiuoli S, Pertea M, Allen J, Selengut J, Haft D, Mather MW, Vaidya AB, Martin DM, Fairlamb AH, Fraunholz MJ, Roos DS, Ralph SA, McFadden GI, Cummings LM, Subramanian GM, Mungall C, Venter JC, Carucci DJ, Hoffman SL, Newbold C, Davis RW, Fraser CM, Barrell B. 2002. Genome sequence of the human malaria parasite Plasmodium falciparum. Nature. 419:498-511
  18. Jackson AP. 2015. Preface. The evolution of parasite genomes and the origins of parasitism. Parasitology. 142 Suppl 1:S1-5. https://www.ncbi.nlm.nih.gov/pubmed/25656359
  19. DeBarry JD, Kissinger JC. 2011. Jumbled Genomes: Missing Apicomplexan Synteny. Mol Biol Evol. 2011 Oct; 28(10): 2855–2871. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3176833/
  20. Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW, Carlton JM, Pain A, Nelson KE, Bowman S, Paulsen IT, James K, Eisen JA, Rutherford K, Salzberg SL, Craig A, Kyes S, Chan MS, Nene V, Shallom SJ, Suh B, Peterson J, Angiuoli S, Pertea M, Allen J, Selengut J, Haft D, Mather MW, Vaidya AB, Martin DM, Fairlamb AH, Fraunholz MJ, Roos DS, Ralph SA, McFadden GI, Cummings LM, Subramanian GM, Mungall C, Venter JC, Carucci DJ, Hoffman SL, Newbold C, Davis RW, Fraser CM, Barrell B. 2002. Genome sequence of the human malaria parasite Plasmodium falciparum. Nature. 419:498-511
  21. Carlton JM, Adams JH, Silva JC, Bidwell SL, Lorenzi H, Caler E, Crabtree J, Angiuoli SV, Merino EF, Amedeo P, Cheng Q, Coulson RM, Crabb BS, Del Portillo HA, Essien K, Feldblyum TV, Fernandez-Becerra C, Gilson PR, Gueye AH, Guo X, Kang'a S, Kooij TW, Korsinczky M, Meyer EV, Nene V, Paulsen I, White O, Ralph SA, Ren Q, Sargeant TJ, Salzberg SL, Stoeckert CJ, Sullivan SA, Yamamoto MM, Hoffman SL, Wortman JR, Gardner MJ, Galinski MR, Barnwell JW, Fraser-Liggett CM. 2008. Comparative genomics of the neglected human malaria parasite Plasmodium vivax. Nature. 455:757-63. https://www.ncbi.nlm.nih.gov/pubmed/18843361
  22. Nikbakht H, Xia X, Hickey DA. 2014. The evolution of genomic GC content undergoes a rapid reversal within the genus Plasmodium. Genome. 9:507-511. https://www.ncbi.nlm.nih.gov/pubmed/25633864
  23. Das A, Sharma M, Gupta B, Dash AP. 2009. Plasmodium falciparum and Plasmodium vivax: so similar, yet very different. Parasitol Res. 105:1169-71. https://www.ncbi.nlm.nih.gov/pubmed/19543915
  24. Bull PC, Buckee CO, Kyes S, Kortok MM, Thathy V, Guyah B, Stoute JA, Newbold CI, Marsh K. 2008. Plasmodium falciparum antigenic variation. Mapping mosaic var gene sequences onto a network of shared, highly polymorphic sequence blocks. Mol Microbiol. 68:1519-34. https://www.ncbi.nlm.nih.gov/pubmed/18433451?dopt=Abstract
  25. Niang M, Yan Yam X, Preiser PR. 2009. The Plasmodium falciparum STEVOR multigene family mediates antigenic variation of the infected erythrocyte. PLoS Pathog. 5:e1000307. https://www.ncbi.nlm.nih.gov/pubmed/19229319
  26. Witmer K, Schmid CD, Brancucci NM, Luah YH, Preiser PR, Bozdech Z, Voss TS. 2012. Analysis of subtelomeric virulence gene families in Plasmodium falciparum by comparative transcriptional profiling. Mol Microbiol. 84:243-59. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3491689/
  27. Petter M, Bonow I, Klinkert MQ. 2008. Diverse expression patterns of subgroups of the rif multigene family during Plasmodium falciparum gametocytogenesis. PLoS One. 3:e3779. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0003779
  28. Singh V, Gupta P, Pande V. 2014. Revisiting the multigene families: Plasmodium var and vir genes. J Vector Borne Dis. 51:75-81. https://www.ncbi.nlm.nih.gov/pubmed/24947212
  29. Fernandez-Becerra C, Yamamoto MM, Vêncio RZ, Lacerda M, Rosanas-Urgell A, del Portillo HA. 2009. Plasmodium vivax and the importance of the subtelomeric multigene vir superfamily. Trends Parasitol. 2009 25:44-51. https://www.ncbi.nlm.nih.gov/pubmed/19036639
  30. Lopez FJ, Bernabeu M, Fernandez-Becerra C, del Portillo HA. 2013. A new computational approach redefines the subtelomeric vir superfamily of Plasmodium vivax. BMC Genomics. 14:8. https://www.ncbi.nlm.nih.gov/pubmed/?term=A+new+computational+approach+redefines+the+subtelomeric+vir+superfamily+of+Plasmodium+vivax
  31. Carlton JM, Adams JH, Silva JC, Bidwell SL, Lorenzi H, Caler E, Crabtree J, Angiuoli SV, Merino EF, Amedeo P, Cheng Q, Coulson RM, Crabb BS, Del Portillo HA, Essien K, Feldblyum TV, Fernandez-Becerra C, Gilson PR, Gueye AH, Guo X, Kang'a S, Kooij TW, Korsinczky M, Meyer EV, Nene V, Paulsen I, White O, Ralph SA, Ren Q, Sargeant TJ, Salzberg SL, Stoeckert CJ, Sullivan SA, Yamamoto MM, Hoffman SL, Wortman JR, Gardner MJ, Galinski MR, Barnwell JW, Fraser-Liggett CM. 2008. Comparative genomics of the neglected human malaria parasite Plasmodium vivax. Nature. 455:757-63. https://www.ncbi.nlm.nih.gov/pubmed/18843361
  32. Lopez FJ, Bernabeu M, Fernandez-Becerra C, del Portillo HA. 2013. A new computational approach redefines the subtelomeric vir superfamily of Plasmodium vivax. BMC Genomics. 14:8. https://www.ncbi.nlm.nih.gov/pubmed/?term=A+new+computational+approach+redefines+the+subtelomeric+vir+superfamily+of+Plasmodium+vivax
  33. Neafsey DE, Galinsky K, Jiang RH, Young L, Sykes SM, Saif S, Gujja S, Goldberg JM, Young S, Zeng Q, Chapman SB, Dash AP, Anvikar AR, Sutton PL, Birren BW, Escalante AA, Barnwell JW, Carlton JM. 2012. The malaria parasite Plasmodium vivax exhibits greater genetic diversity than Plasmodium falciparum. Nat Genet. 44:1046-50. https://www.ncbi.nlm.nih.gov/pubmed/22863733
  34. Neafsey DE, Galinsky K, Jiang RH, Young L, Sykes SM, Saif S, Gujja S, Goldberg JM, Young S, Zeng Q, Chapman SB, Dash AP, Anvikar AR, Sutton PL, Birren BW, Escalante AA, Barnwell JW, Carlton JM. 2012. The malaria parasite Plasmodium vivax exhibits greater genetic diversity than Plasmodium falciparum. Nat Genet. 44:1046-50. https://www.ncbi.nlm.nih.gov/pubmed/22863733
  35. Sundararaman SA, Plenderleith LJ, Liu W, Loy DE, Learn GH, Li Y, Shaw KS, Ayouba A, Peeters M, Speede S, Shaw GM, Bushman FD, Brisson D, Rayner JC, Sharp PM, Hahn BH. 2016. Genomes of cryptic chimpanzee Plasmodium species reveal key evolutionary events leading to human malaria. Nat Commun. 7:11078. https://www.ncbi.nlm.nih.gov/pubmed/27002652
  36. Sundararaman SA, Plenderleith LJ, Liu W, Loy DE, Learn GH, Li Y, Shaw KS, Ayouba A, Peeters M, Speede S, Shaw GM, Bushman FD, Brisson D, Rayner JC, Sharp PM, Hahn BH. 2016. Genomes of cryptic chimpanzee Plasmodium species reveal key evolutionary events leading to human malaria. Nat Commun. 7:11078. https://www.ncbi.nlm.nih.gov/pubmed/27002652
  37. Forni D, Pontremoli C, Cagliani R, Pozzoli U, Clerici M, Sironi M. 2015. Positive selection underlies the species-specific binding of Plasmodium falciparum RH5 to human basigin. Mol Ecol. 24:4711-22. https://www.ncbi.nlm.nih.gov/pubmed/26302433
  38. Wasmuth J, Daub J, Peregrín-Alvarez JM, Finney CA, Parkinson J. 2009. The origins of apicomplexan sequence innovation. Genome Res. 19:1202-13. https://www.ncbi.nlm.nih.gov/pubmed/19363216
  39. DeBarry JD, Kissinger JC. 2011. Jumbled Genomes: Missing Apicomplexan Synteny. Mol Biol Evol. 2011 Oct; 28(10): 2855–2871. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3176833/
  40. Michalak P. 2008. Coexpression, coregulation, and cofunctionality of neighboring genes in eukaryotic genomes. Genomics. 91:(43–248) http://www.sciencedirect.com/science/article/pii/S0888754307002807
  41. Ghanbarian AT, Hurst LD. 2015. Neighboring Genes Show Correlated Evolution in Gene Expression. Mol Biol Evol. doi: 10.1093/molbev/msv053 http://mbe.oxfordjournals.org/content/early/2015/04/01/molbev.msv053.full
  42. De S, Teichmann SA, Babu MM. 2009. The impact of genomic neighborhood on the evolution of human and chimpanzee transcriptome. Genome Res. 19(5): 785–794. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2675967/
  43. Lanfrancotti A, Bertuccini L, Silvestrini F, Alano P. 2007. Plasmodium falciparum: mRNA co-expression and protein co-localisation of two gene products upregulated in early gametocytes. Exp Parasitol. 116:497-503. https://www.ncbi.nlm.nih.gov/pubmed/17367781
  44. Tang H, Lyons E. 2012. Unleashing the Genome of Brassica Rapa. Front Plant Sci. 3: 172. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3408644/
  45. Tachibana SI, Sullivan SA, Kawai S, Nakamura S, Kim HR, Goto N, Arisue N, Palacpac NM, Honma H, Yagi M, Tougan T, Katakai Y, Kaneko O, Mita T, Kita K, Yasutomi Y, Sutton PL, Shakhbatyan R, Horii T, Yasunaga T, Barnwell JB, Escalante AA, Carlton JM, Tanabe K. 2012. Plasmodium cynomolgi genome sequences provide insight into Plasmodium vivax and the monkey malaria clade. Nat Genet. 44: 1051–1055. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3759362/
  46. Pacheco MA, Reid MJ, Schillaci MA, Lowenberger CA, Galdikas BM, Jones-Engel L, Escalante AA. 2012. The origin of malarial parasites in orangutans. PLoS One. 7:e34990. https://www.ncbi.nlm.nih.gov/pubmed/22536346
  47. Lau AO. 2009. An overview of the Babesia, Plasmodium and Theileria genomes: A comparative perspective. Mol Biochem Parasitol. 164:1-8. http://www.sciencedirect.com/science/article/pii/S016668510800279X
  48. Pacheco MA, Reid MJ, Schillaci MA, Lowenberger CA, Galdikas BM, Jones-Engel L, Escalante AA. 2012. The origin of malarial parasites in orangutans. PLoS One. 7:e34990. https://www.ncbi.nlm.nih.gov/pubmed/22536346
  49. Rayner JC, Liu W, Peeters M, Sharp PM, Hahn BH. 2011. A plethora of Plasmodium species in wild apes: a source of human infection? Trends Parasitol. 27:222-9. https://www.ncbi.nlm.nih.gov/pubmed/21354860?dopt=Abstract&holding=npg
  50. Otto TD, Rayner JC, Böhme U, Pain A, Spottiswoode N, Sanders M, Quail M, Ollomo B, Renaud F, Thomas AW, Prugnolle F, Conway DJ, Newbold C, Berriman M. 2014. Genome sequencing of chimpanzee malaria parasites reveals possible pathways of adaptation to human hosts. Nat Commun. 5:4754. https://www.ncbi.nlm.nih.gov/pubmed/25203297
  51. Sundararaman SA, Plenderleith LJ, Liu W, Loy DE, Learn GH, Li Y, Shaw KS, Ayouba A, Peeters M, Speede S, Shaw GM, Bushman FD, Brisson D, Rayner JC, Sharp PM, Hahn BH. 2016. Genomes of cryptic chimpanzee Plasmodium species reveal key evolutionary events leading to human malaria. Nat Commun. 7:11078. https://www.ncbi.nlm.nih.gov/pubmed/27002652
  52. Arisue N, Kawai S, Hirai M, Palacpac NM, Jia M, Kaneko A, Tanabe K, Horii T. 2011. Clues to Evolution of the SERA Multigene Family in 18 Plasmodium Species. PLoS One. 6: e17775. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0017775
  53. Arisue N, Hirai M, Arai M, Matsuoka H, Horii T. 2007. Phylogeny and evolution of the SERA multigene family in the genus Plasmodium. J Mol Evol. 65:82-91. http://link.springer.com/article/10.1007%2Fs00239-006-0253-1
  54. Arisue N, Kawai S, Hirai M, Palacpac NM, Jia M, Kaneko A, Tanabe K, Horii T. 2011. Clues to Evolution of the SERA Multigene Family in 18 Plasmodium Species. PLoS One. 6: e17775. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0017775
  55. Rice BL, Acosta MM, Pacheco MA, Carlton JM, Barnwell JW, Escalante AA. 2014. The origin and diversification of the merozoite surface protein 3 (msp3) multi-gene family in Plasmodium vivax and related parasites. Mol Phylogenet Evol. 78:172-84. https://www.ncbi.nlm.nih.gov/pubmed/24862221
  56. Peixoto L, Fernández V, Musto H. 2004. The effect of expression levels on codon usage in Plasmodium falciparum. Parasitology. 128:245-51. https://www.ncbi.nlm.nih.gov/pubmed/15074874
  57. Yadav MK, Swati D. 2012. Comparative genome analysis of six malarial parasites using codon usage bias based tools. Bioinformation. 8:1230-9. https://www.ncbi.nlm.nih.gov/pubmed/23275725
  58. Chien JT, Pakala SB, Geraldo JA, Lapp SA, Humphrey JC, Barnwell JW, Kissinger JC, Galinski MR. 2016. High-Quality Genome Assembly and Annotation for Plasmodium coatneyi, Generated Using Single-Molecule Real-Time PacBio Technology. Genome Announc. 4: e00883-16. https://www.ncbi.nlm.nih.gov/pubmed/27587810
  59. World Health Organization. (2015). World Malaria Report 2015. Retrieved from http://www.who.int/malaria/publications/world-malaria-report-2015/report/en/
  60. Ta TH, Hisam S, Lanza M, Jiram AI, Ismail N, Rubio JM. 2014. First case of a naturally acquired human infection with Plasmodium cynomolgi. Malar J. 13: 68. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3937822/
  61. Singh B, Daneshvar C. 2013. Human infections and detection of Plasmodium knowlesi. Clin Microbiol Rev. 26:165-84. https://www.ncbi.nlm.nih.gov/pubmed/23554413
  62. Prugnolle F, Durand P, Neel C, Ollomo B, Ayala FJ, Arnathau C, Etienne L, Mpoudi-Ngole E, Nkoghe D, Leroy E, Delaporte E, Peeters M, Renaud F. 2010. African great apes are natural hosts of multiple related malaria species, including Plasmodium falciparum. Proc Natl Acad Sci U S A. 107:1458-63. https://www.ncbi.nlm.nih.gov/pubmed/20133889
  63. Duval L, Fourment M, Nerrienet E, Rousset D, Sadeuh SA, Goodman SM, Andriaholinirina NV, Randrianarivelojosia M, Paul RE, Robert V, Ayala FJ, Ariey F. 2010. African apes as reservoirs of Plasmodium falciparum and the origin and diversification of the Laverania subgenus. Proc Natl Acad Sci U S A. 107:10561-6. https://www.ncbi.nlm.nih.gov/pubmed/20498054