CoGepedia:Current events: Difference between revisions
No edit summary |
|||
Line 26: | Line 26: | ||
===Database Schema Change Details=== | ===Database Schema Change Details=== | ||
If it wasn't for CoGe's Database, nothing else in the system would work. This is the core of CoGe, and in a relatively simple schema, allows for multiple versions of multiple genomes, from multiple data sources, of multiple organisms to all be stored in the same place. As the number of genomes in CoGe grew, the database scaled very well, easily accommodating over 6000 genomes and 80 Gigabases of sequence. The overall size of the database was around 100GB, with several million entries spread over its tables. While the system could handle this much data, our snapshot and backup system did not. CoGe uses MySQL to drive its database, and every time a table was changed (e.g. add a new genome), all the binary files of the database changed, and our backup system started copying another 100GB of data. As new genomes are added to the system regularly, we quickly ran into problems with our backup server as our couple terabytes of storage were getting filled in days, and we needed backups spanning months just in case something goes really, really bad. The solution was simple -- move the raw genomic sequence data out of the database and into flat-files. This shrank our database size to around 20GB, and made our backup system much happier. Overall, this did not result in a serious rewrite of the API, and had an extra benefit of substantially speeding up the retrieval of genome sizes from the database (essential for some comparative genomic work. Instead, I have Brent to thank for | If it wasn't for CoGe's Database, nothing else in the system would work. This is the core of CoGe, and in a relatively simple schema, allows for multiple versions of multiple genomes, from multiple data sources, of multiple organisms to all be stored in the same place. As the number of genomes in CoGe grew, the database scaled very well, easily accommodating over 6000 genomes and 80 Gigabases of sequence. The overall size of the database was around 100GB, with several million entries spread over its tables. While the system could handle this much data, our snapshot and backup system did not. CoGe uses MySQL to drive its database, and every time a table was changed (e.g. add a new genome), all the binary files of the database changed, and our backup system started copying another 100GB of data. As new genomes are added to the system regularly, we quickly ran into problems with our backup server as our couple terabytes of storage were getting filled in days, and we needed backups spanning months just in case something goes really, really bad. The solution was simple -- move the raw genomic sequence data out of the database and into flat-files. This shrank our database size to around 20GB, and made our backup system much happier. Overall, this did not result in a serious rewrite of the API, and had an extra benefit of substantially speeding up the retrieval of genome sizes from the database (essential for some comparative genomic work.) Instead, I have Brent to thank for inspiring the major change in the API. | ||
Brent had | Brent had the great idea of separating the genomic sequence data from the datasets that specify genomic features and annotations (e.g. genes, proteins, repeat sequences). This would allow use to take a genome, mask it any way we like, and immediately be able to hook up the newly processed genomic sequence to it original annotations. Also, if we (or others) created a custom set of annotations (e.g. conserved non-coding sequences, small peptides, missed annotations) we could easily add those annotations to a pre-existing genome without needing to duplicate data. Also, this allows for the creation of a unified "genome", which is actually a lot more difficult to do from a data-management standpoint when dealing with genomes sequenced by anyone. | ||
While this new schema promised a substantial improvement in how we managed genomic data, implementing it in CoGe was difficult. Since our group, and others, have published datasets and papers with links to CoGe for a particular set of sequences or analyses that sometimes use CoGe's unique database ids, the unique database ids for major data points needed to be faithfully mapped. Also, since this data comes from several places, some that can be retrieved in an automated fashion, others requiring much more manual work, reloading everything from their raw data files would have been painful. Also, this database schema change reflected a break in how CoGe relates two major sets of data -- genomic sequence and features. In order to get this schema to work with the | While this new schema promised a substantial improvement in how we managed genomic data, implementing it in CoGe was difficult. Since our group, and others, have published datasets and papers with links to CoGe for a particular set of sequences or analyses that sometimes use CoGe's unique database ids, the unique database ids for major data points needed to be faithfully mapped. Also, since this data comes from several places, some that can be retrieved in an automated fashion, others requiring much more manual work, reloading everything from their raw data files would have been painful. Also, this database schema change reflected a break in how CoGe relates two major sets of data -- genomic sequence and features. In order to get this schema to work with the existing suite of tools, the API needed some substantial work, and every web-based and command-line tool in CoGe needed to be updated to reflect these changes. | ||
Fortunately, I had just finished writing and submitting my PhD thesis, and was looking to get back to coding. | Fortunately, I had just finished writing and submitting my PhD thesis, and was looking to get back to coding. | ||
===Performance Tuning=== | |||
Nobody likes to wait for anything, especially when it comes to comparative genomics. To a large extent, CoGe is designed to allow researchers to quickly go from an idea or hypothesis, to testing it, to visualizing and interpreting the result, to getting a new idea, to testing that . . . so when you have to wait for the spinning double-helix to stop to get your results, you may lose momentum. Thus, while digging through the code to update the API, I took time to find areas of CoGe that were slow and tried to bump up the performance. | |||
This fell into four general classes | |||
*Server-side parallelization | |||
*Client-side javascript rewrites | |||
*Preservation of large analyses and data files | |||
*Algorithm improvements |
Revision as of 23:02, 15 July 2009
Version 3 of CoGe is released!
General Announcement
July 14th, 2009
After a 6-month round of development, I am happy to announce that a new version of CoGe is ready for use. This major release includes:
- Modification of the CoGe database schema (and the resulting data migration issues)
- Update of the database API
- Update of all web programs to new API
- Performance tuning of web applications
- Update of web applications' user interface (predominately using jQuery)
- A grossly under-populated wiki!
Thanks to:
- Josh Kane
- Brent Pedersen
- Dan Hembry
- Damon Lisch
- Michael Freeling
--Eric Lyons
Database Schema Change Details
If it wasn't for CoGe's Database, nothing else in the system would work. This is the core of CoGe, and in a relatively simple schema, allows for multiple versions of multiple genomes, from multiple data sources, of multiple organisms to all be stored in the same place. As the number of genomes in CoGe grew, the database scaled very well, easily accommodating over 6000 genomes and 80 Gigabases of sequence. The overall size of the database was around 100GB, with several million entries spread over its tables. While the system could handle this much data, our snapshot and backup system did not. CoGe uses MySQL to drive its database, and every time a table was changed (e.g. add a new genome), all the binary files of the database changed, and our backup system started copying another 100GB of data. As new genomes are added to the system regularly, we quickly ran into problems with our backup server as our couple terabytes of storage were getting filled in days, and we needed backups spanning months just in case something goes really, really bad. The solution was simple -- move the raw genomic sequence data out of the database and into flat-files. This shrank our database size to around 20GB, and made our backup system much happier. Overall, this did not result in a serious rewrite of the API, and had an extra benefit of substantially speeding up the retrieval of genome sizes from the database (essential for some comparative genomic work.) Instead, I have Brent to thank for inspiring the major change in the API.
Brent had the great idea of separating the genomic sequence data from the datasets that specify genomic features and annotations (e.g. genes, proteins, repeat sequences). This would allow use to take a genome, mask it any way we like, and immediately be able to hook up the newly processed genomic sequence to it original annotations. Also, if we (or others) created a custom set of annotations (e.g. conserved non-coding sequences, small peptides, missed annotations) we could easily add those annotations to a pre-existing genome without needing to duplicate data. Also, this allows for the creation of a unified "genome", which is actually a lot more difficult to do from a data-management standpoint when dealing with genomes sequenced by anyone.
While this new schema promised a substantial improvement in how we managed genomic data, implementing it in CoGe was difficult. Since our group, and others, have published datasets and papers with links to CoGe for a particular set of sequences or analyses that sometimes use CoGe's unique database ids, the unique database ids for major data points needed to be faithfully mapped. Also, since this data comes from several places, some that can be retrieved in an automated fashion, others requiring much more manual work, reloading everything from their raw data files would have been painful. Also, this database schema change reflected a break in how CoGe relates two major sets of data -- genomic sequence and features. In order to get this schema to work with the existing suite of tools, the API needed some substantial work, and every web-based and command-line tool in CoGe needed to be updated to reflect these changes.
Fortunately, I had just finished writing and submitting my PhD thesis, and was looking to get back to coding.
Performance Tuning
Nobody likes to wait for anything, especially when it comes to comparative genomics. To a large extent, CoGe is designed to allow researchers to quickly go from an idea or hypothesis, to testing it, to visualizing and interpreting the result, to getting a new idea, to testing that . . . so when you have to wait for the spinning double-helix to stop to get your results, you may lose momentum. Thus, while digging through the code to update the API, I took time to find areas of CoGe that were slow and tried to bump up the performance.
This fell into four general classes
- Server-side parallelization
- Client-side javascript rewrites
- Preservation of large analyses and data files
- Algorithm improvements