There are two general programs to run:
Contents |
The UAGC produces many genomic sequences. This is to help them streamline their procedure for loading genomes into CoGe
Note:Your fasta sequence headers will be used as the chromosome or contig (scaffold, superscaffold, etc) name. These are parsed to use the first set of all non-whitespace characters. (i.e. everything after the first space in a header name will not be used). Make sure that these "chromosome" names exactly match the ones used in the gene model/annotation file (e.g. GFF3). This is how gene models are matched to their associated chromosome of residence.
~/projects/CoGeX/scripts/load_genomes_n_stuff/fasta_genome_loader.pl \ -org_name "Acidovorax sp. strain JS42 substrain KSJ2" \ -org_desc "Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Acidovorax;" \ -source_name "University of Arizona Genetics Core" \ -source_link "http://uagc.arl.arizona.edu/" \ -ds_version .1 \ -nt KSJ2_454AllContigs.fna \ -dsg_restricted 1
mysql> select * from genomic_sequence_type; +--------------------------+-----------------------------------+---------------------------------------------------------------------------------------------------------------+ | genomic_sequence_type_id | name | description | +--------------------------+-----------------------------------+---------------------------------------------------------------------------------------------------------------+ | 1 | unmasked | unmasked sequence data | | 2 | masked repeats 50x | repeats with more than 50x occurrence have been masked | | 3 | 50X mask +syntenic thread with Os | double masked: 50x repeats and non-coding sequences. CNS sequences with Os retained | | 4 | masked repeats 40x | repeats with more than 40x occurrence have been masked | | 5 | super masked repeats 50x | repeats with more than 50x occurrence have been masked. Additional processing was needed for these sequences | | 6 | te+kmer masked | transposons and kmer hard masked (Bao method) | | 7 | masked by JGI | downloaded masked | | 8 | masked by genoscope | NULL | | 9 | RepeatMasker | with MIPS repeat data | | 10 | masked by Cacao Genome Database | NULL | | 11 | Repeat masked by Andrea Zuccolo | NULL | | 12 | masked by GMGC-nt | NULL | | 13 | masked by GMGC | NULL | +--------------------------+-----------------------------------+---------------------------------------------------------------------------------------------------------------+
.
.
.
Creating feature of type chromosome
Creating feature_name chromosome contig00315 for feat 81329038
Adding location contig00315:(1-70, 1)
Loading genomic sequence for chromosome: contig00316 (9 nt)
Working on chromosome contig00316 of type chromosome
Creating feature of type chromosome
Creating feature_name chromosome contig00316 for feat 81329039
Adding location contig00316:(1-9, 1)
Formatdb running /usr/bin/formatdb -p F -o T -i /opt/apache/CoGe/data/genomic_sequence/0/0/11/11229/11229.faa
dataset_id: 46764
dataset_group_id: 11229
When fasta_genome_loader.pl is run, you'll see a stream of text letting you know that genomic features for chromosomes are being created in CoGe, and that names, locations, etc. are being specified for those genomic features. Note, that each chromosome is itself tracked as a genomic feature within CoGe. One set of these will be created for each chromosome (contig, etc.) loaded from you fasta file. When all sequences are loaded, this program will create the blastable databases for the sequence, and the final lines specify the dataset_id and dataset_group_id for the newly loaded genome. Save these numbers in your notes. You may need them in the future (but you can always search for them in OrganismView).
Often, you'll want to take a contig level assembly and generate a complete genome by comparison to a reference genome. CoGe has support for doing this AND printing out an assembled version of your genome. This assembled genome may be reloaded (as a higher version) using the method listed above. Follow this link for information on syntenic path assembly in CoGe.
Prodigal is a great program for quickly and easily identifying gene models in bacterial genomes. To annotate a genome using the program (e.g. one on which you just performed a syntenic path assembly), you may run:
prodigal -f gff < KSJ2_no-split-syntenic_path_assembly.fna > prodigal_models.gff
Warning: There is quite a bit of heterogeneity in genome model and annotation formats; even within the same format (e.g. GFF3). Thus, there is not a one-use program in CoGe that works for all GFF3 files. However, there is a general program that is customized for each data source's example implementation of GFF3, and many such versions exist. Also, the program in CoGe for loading GFF3 data (gff_annotation_loader.pl), has become a bit like frankencode. Please feel free to contact Eric Lyons for any help/customizations, etc. Fortunately, once an institute has settled on a consistent format for GFF3 (or other file type), you can use the same program for all your genome loading needs. Once this program has been appropriately customized, it is very easy to run to load the annotations/gene models into CoGe:
./gff_annotation_loader.pl -dsid 46765 -gff_file prodigal_models.gff
This will run the program, output all the actions it would perform, but not actually load the data into CoGe. To do the load, add -go 1 to the end of the command:
./gff_annotation_loader.pl -dsid 46765 -gff_file prodigal_models.gff -go 1
Note: An example of the customization for gff_annotation_loader.pl is as follows for processing the GFF3 file produced from Prodigal:
If you make a mistake and need to delete a genome you've loaded, just delete the whole genome (annotations and gene models included), and reload. To delete:
~/projects/CoGeX/scripts/delete_dataset_group.pl -dsgid 11230 -delete_seqs
NOTE: Make sure you get the dataset_group id (dsgid) correct!!! Otherwise, very bad things will happen. Fortunately, there are backups, but contact [1] immediately!