CoGe version 3
July 14th, 2009
After a 6-month round of development, I am happy to announce that a new version of CoGe is ready for use. This major release includes:
- Modification of the CoGe database schema (and the resulting data migration issues)
- Update of the database API
- Update of all web programs to new API
- Performance tuning of web applications
- Update of web applications' user interface (predominately using jQuery)
- A grossly under-populated wiki!
- Josh Kane
- Brent Pedersen
- Dan Hembry
- Damon Lisch
- Michael Freeling
Database Schema Change Details
If it wasn't for CoGe's Database, nothing else in the system would work. This is the core of CoGe, and in a relatively simple schema, allows for multiple versions of multiple genomes, from multiple data sources, of multiple organisms to all be stored in the same place. As the number of genomes in CoGe grew, the database scaled very well, easily accommodating over 6000 genomes and 80 Gigabases of sequence. The overall size of the database was around 100GB, with several million entries spread over its tables. While the system could handle this much data, our snapshot and backup system did not. CoGe uses MySQL to drive its database, and every time a table was changed (e.g. add a new genome), all the binary files of the database changed, and our backup system started copying another 100GB of data. As new genomes are added to the system regularly, we quickly ran into problems with our backup server as our couple terabytes of storage were getting filled in days, and we needed backups spanning months just in case something goes really, really bad. The solution was simple -- move the raw genomic sequence data out of the database and into flat-files. This shrank our database size to around 20GB, and made our backup system much happier. Overall, this did not result in a serious rewrite of the API, and had an extra benefit of substantially speeding up the retrieval of genome sizes from the database (essential for some comparative genomic work.) Instead, I have Brent to thank for inspiring the major change in the API.
Brent had the great idea of separating the genomic sequence data from the datasets that specify genomic features and annotations (e.g. genes, proteins, repeat sequences). This would allow us to take a genome, mask it any way we like, and immediately be able to hook up the newly processed genomic sequence to it original annotations. Also, if we (or others) created a custom set of annotations (e.g. conserved non-coding sequences, small peptides, missed annotations) we could easily add those annotations to a pre-existing genome without needing to duplicate data. Also, this allows for the creation of a unified "genome", which is actually a lot more difficult to do from a data-management standpoint when dealing with genomes sequenced by anyone.
While this new schema promised a substantial improvement in how we managed genomic data, implementing it in CoGe was difficult. Since our group, and others, have published datasets and papers with links to CoGe for a particular set of sequences or analyses that sometimes use CoGe's unique database ids, the unique database ids for major data points needed to be faithfully mapped. Also, since this data comes from several places, some that can be retrieved in an automated fashion, others requiring much more manual work, reloading everything from their raw data files would have been painful. Also, this database schema change reflected a break in how CoGe relates two major sets of data -- genomic sequence and features. In order to get this schema to work with the existing suite of tools, the API needed some substantial work, and every web-based and command-line tool in CoGe needed to be updated to reflect these changes.
Fortunately, I had just finished writing and submitting my PhD thesis, and was looking to get back to coding.
Nobody likes to wait for anything, especially when it comes to comparative genomics. To a large extent, CoGe is designed to allow researchers to quickly go from an idea or hypothesis, to testing it, to visualizing and interpreting the result, to getting a new idea, to testing that . . . so when you have to wait for the spinning double-helix to stop to get your results, you may lose momentum. Thus, while digging through the code to update the API, I took time to find areas of CoGe that were slow and tried to bump up the performance.
This fell into four general classes:
Currently CoGe runs on an 8-core server, and the process of parallelization means figuring out areas of a computational program that can be run in parallel and divvied up among the multiple CPUs, or sent out to another server with extra processing power. While much of CoGe already ran in parallel, there were more places the methodology could be implemented. However, the problem was finding the places to parallelize that would result in a substantial performance increase for most analyses. GEvo, CoGe's tools for comparing multiple genomic regions in high-detail, was high on the list for such optimizations. I had already had all the pair-wise sequence comparisons running in parallel, but the images of the results were being generated in series. When I timed each stage of a GEvo analysis, I found that in almost all cases, the image generation was the slowest step of the analysis. Sometimes being an order of magnitude slower that the rest of the analysis combined. When parallelized, up to eight images would be generated at the same time, which is usually many more genomic regions than are usually analyzed at once.
Preservation of large analyses and data files
If you are going to compare two genomes consisting of hundreds of megabases of sequences, changes are you are going to want to look at the results again. As such, intermediate files of large analyses are now saved, and if an anlysis is revisited and modified, only the steps using the changes are processed anew. SynMap is an example of using this approach as it will compare whole genomes using blast and find syntenic regions using [dagchainer.sourceforge.net DAGChainer] to identify syntenic regions (with a couple of other data processing steps). This may take hours to complete the first time for large genomes. If the analysis is repeated with different DAGChainer settings, or synonymous change calculations are wanted, the first stages of the analysis are already completed.
Sometimes, the way an algorithm was implemented worked, but the performance scaled very poorly. There were several of these in CoGe on which I worked, and probably many more to find. One example of this was the way CoGe's graphics library, [GeLo], worked when searching for features that overlapped at the same position.