CoGe version 3

From CoGepedia
Jump to: navigation, search

General Announcement

July 14th, 2009

After a 6-month round of development, I am happy to announce that a new version of CoGe is ready for use.   This major release includes:

  • Modification of the CoGe database schema (and the resulting data migration issues)
  • Update of the database API
  • Update of all web programs to new API
  • Performance tuning of web applications
  • Update of web applications' user interface (predominately using jQuery)
  • A grossly under-populated wiki!

Thanks to:

  • Josh Kane
  • Brent Pedersen
  • Dan Hembry
  • Damon Lisch
  • Michael Freeling


--Eric Lyons

Database Schema Change Details

If it wasn't for CoGe's Database, nothing else in the system would work. This is the core of CoGe, and in a relatively simple schema, allows for multiple versions of multiple genomes, from multiple data sources, of multiple organisms to all be stored in the same place. As the number of genomes in CoGe grew, the database scaled very well, easily accommodating over 6000 genomes and 80 Gigabases of sequence. The overall size of the database was around 100GB, with several million entries spread over its tables. While the system could handle this much data, our snapshot and backup system did not. CoGe uses MySQL to drive its database, and every time a table was changed (e.g. add a new genome), all the binary files of the database changed, and our backup system started copying another 100GB of data. As new genomes are added to the system regularly, we quickly ran into problems with our backup server as our couple terabytes of storage were getting filled in days, and we needed backups spanning months just in case something goes really, really bad. The solution was simple -- move the raw genomic sequence data out of the database and into flat-files. This shrank our database size to around 20GB, and made our backup system much happier. Overall, this did not result in a serious rewrite of the API, and had an extra benefit of substantially speeding up the retrieval of genome sizes from the database (essential for some comparative genomic work.) Instead, I have Brent to thank for inspiring the major change in the API.

Brent had the great idea of separating the genomic sequence data from the datasets that specify genomic features and annotations (e.g. genes, proteins, repeat sequences). This would allow us to take a genome, mask it any way we like, and immediately be able to hook up the newly processed genomic sequence to it original annotations. Also, if we (or others) created a custom set of annotations (e.g. conserved non-coding sequences, small peptides, missed annotations) we could easily add those annotations to a pre-existing genome without needing to duplicate data. Also, this allows for the creation of a unified "genome", which is actually a lot more difficult to do from a data-management standpoint when dealing with genomes sequenced by anyone.

While this new schema promised a substantial improvement in how we managed genomic data, implementing it in CoGe was difficult. Since our group, and others, have published datasets and papers with links to CoGe for a particular set of sequences or analyses that sometimes use CoGe's unique database ids, the unique database ids for major data points needed to be faithfully mapped. Also, since this data comes from several places, some that can be retrieved in an automated fashion, others requiring much more manual work, reloading everything from their raw data files would have been painful. Also, this database schema change reflected a break in how CoGe relates two major sets of data -- genomic sequence and features. In order to get this schema to work with the existing suite of tools, the API needed some substantial work, and every web-based and command-line tool in CoGe needed to be updated to reflect these changes.

Fortunately, I had just finished writing and submitting my PhD thesis, and was looking to get back to coding.

Performance Tuning

Nobody likes to wait for anything, especially when it comes to comparative genomics. To a large extent, CoGe is designed to allow researchers to quickly go from an idea or hypothesis, to testing it, to visualizing and interpreting the result, to getting a new idea, to testing that . . . so when you have to wait for the spinning double-helix to stop to get your results, you may lose momentum. Thus, while digging through the code to update the API, I took time to find areas of CoGe that were slow and tried to bump up the performance.

This fell into four general classes:

Server-side parallelization

Currently CoGe runs on an 8-core server, and the process of parallelization means figuring out areas of a computational program that can be run in parallel and divvied up among the multiple CPUs, or sent out to another server with extra processing power. While much of CoGe already ran in parallel, there were more places the methodology could be implemented. However, the problem was finding the places to parallelize that would result in a substantial performance increase for most analyses. GEvo, CoGe's tools for comparing multiple genomic regions in high-detail, was high on the list for such optimizations. I had already had all the pair-wise sequence comparisons running in parallel, but the images of the results were being generated in series. When I timed each stage of a GEvo analysis, I found that in almost all cases, the image generation was the slowest step of the analysis. Sometimes being an order of magnitude slower that the rest of the analysis combined. When parallelized, up to eight images would be generated at the same time, which is usually many more genomic regions than are usually analyzed at once.

Client-side javascript rewrites

As the web-interface to CoGe evolved, we used more and more javascript to make the system behave more like a desktop application. This degree of user control on the interface was needed when dealing with the large amounts of data CoGe could generate. CoGeBlast is a prime example of this as this tool lets a researcher blast as many sequences as they want against nearly any number of genomes in CoGe. The utility in CoGeBlast is the integration of blast-results with underlying genomic information. However, there is a lot of data you can get, and not everyone is going to want to see all of it, and more importantly, different questions may require seeing different types of data. The table CoGeBlast has for listing blast hits is interactive, allowing researchers to sort on one or more columns of data, select which columns of data they wish to view, and select genomic regions their sequences hit for further analysis. The first implementation of the column hiding and showing algorithm worked, but could be very slow when used on a large table of results, or on an older computer. This has been updated to use more efficient logic and javascript routines, so the amount of time waiting for columns to hide is greatly decreased.

Preservation of large analyses and data files

If you are going to compare two genomes consisting of hundreds of megabases of sequences, changes are you are going to want to look at the results again. As such, intermediate files of large analyses are now saved, and if an anlysis is revisited and modified, only the steps using the changes are processed anew. SynMap is an example of using this approach as it will compare whole genomes using blast and find syntenic regions using [dagchainer.sourceforge.net DAGChainer] to identify syntenic regions (with a couple of other data processing steps). This may take hours to complete the first time for large genomes. If the analysis is repeated with different DAGChainer settings, or synonymous change calculations are wanted, the first stages of the analysis are already completed.

Algorithm improvements

Sometimes, the way an algorithm was implemented worked, but the performance scaled very poorly. There were several of these in CoGe on which I worked, and probably many more to find. One example of this was the way CoGe's graphics library, [GeLo], worked when searching for features that overlapped at the same position.