Difference between revisions of "How to load genomes into CoGe"

From CoGepedia
Jump to: navigation, search
 
(11 intermediate revisions by the same user not shown)
Line 1: Line 1:
'''NOTE:  You are probably looking for:  [[How to load a private genome into CoGe?  | Loading genomes from CoGe's web interface.]]
+
You can load your own genome sequence, annotation, and quantitative data for use with CoGe's tools.  These data may be kept private and shared with collaborators, or made fully public.
  
'''These instructions are for the command line driven software on CoGe's servers''' and are meant for the administrators of CoGe.
+
#Register for a CyVerse user account if you don't have one: https://user.cyverse.org
 
+
## Add CoGe as a CyVerse Service:
 
+
## [[File:Screen Shot 2017-08-14 at 3.41.40 PM.png|400px]]
 
+
#Log into CoGe: https://genomevolution.org
There are two general programs to run:
+
## [[File:Screen Shot 2020-05-28 at 8.52.36 AM.png|400px]]
*'''fasta_genome_loader.pl''':
+
#Go to your [[User|User Profile Page]] by clicking My Data in the menu bar.  
**Loads in fasta sequences into CoGe
+
## [[File:Screen Shot 2020-05-28 at 8.54.09 AM.png|400px]]  
*annotation loader:
+
#Click '''New''' -> '''New Genome'''
**usually some version of '''gff_annotation_loader.pl''' or some other program for loading text based gene models and annotations
+
##[[File:Screen Shot 2020-05-28 at 8.54.30 AM.png|400px]]
 
+
# Follow [[LoadGenome|this link]] for information on how to use [[LoadGenome]].  
 
+
## Once your genome is loaded into the system, you can add annotation and quantitative data to it using the [[LoadAnnotation]] and [[LoadExperiment]] features.
=UAGC Example=
+
#'''Note:''' Make sure your GFF file is in the correct format for CoGe: [[GFF ingestion]]
The UAGC produces many genomic sequences. This is to help them streamline their procedure for loading genomes into CoGe
+
#If you get an error when accessing the CyVerse Data Store, make sure that CoGe has been added as a service to your CyVerse account: https://user.cyverse.org
#Get 454AllContigs.fna
+
## [[File:Screen Shot 2017-08-14 at 3.41.40 PM.png|400px]]
##This is the usual contig-level genome assembly from the 454 genome sequencing pipeline
+
 
+
==First, load the fasta sequence:  fasta_genome_loader.pl==
+
'''Note:'''Your fasta sequence headers will be used as the chromosome or contig (scaffold, superscaffold, etc) name.  These are parsed to use the first set of all non-whitespace characters. (i.e. everything after the first space in a header name will not be used).  Make sure that these "chromosome" names exactly match the ones used in the gene model/annotation file (e.g. GFF3).  This is how gene models are matched to their associated chromosome of residence.
+
 
+
#run fasta_genome_loader.pl
+
  ~/projects/CoGeX/scripts/load_genomes_n_stuff/fasta_genome_loader.pl \
+
-org_name "Acidovorax sp. strain JS42 substrain KSJ2" \
+
-org_desc "Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Acidovorax;" \
+
-source_name "University of Arizona Genetics Core" \
+
-source_link "http://uagc.arl.arizona.edu/" \
+
-ds_version .1 \
+
-nt KSJ2_454AllContigs.fna \
+
-dsg_restricted 1
+
 
+
===Important Notes===
+
*CoGe organisms genomes by a collection of datasets (often abbreviated as '''ds''') into a dataset_group (abbreviated as '''dsg'''). The general idea is that a genome may consist of multiple files, and we want to track the provenance of each file. If you search for a genome/organism in [[OrganismView]], you'll see that dsg is listed as "genome", but that there is an associated dsgid with each genome.
+
 
+
===Option Descriptions===
+
*-org_name : the name of the organism
+
*-org_desc : the GenBank taxanomic description of the organism
+
*-source_name : the source of the data
+
*-source_desc (optional) : description of the source of the data
+
*-source_link (optional) : a http:// url to the the place that generated the data (or who owns the data)
+
*-ds_version : version number for the genome
+
*-ds_link (optional) : a http:// url to link to the place where the data file was downloaded
+
*-nt : path to the nucleotide
+
*-dsg_restricted (optional) : make this genome private
+
 
+
====Additional Options====
+
*-org_id : if the organism is already entered into CoGe, you can use its internal CoGe ID (available by searching for the organism in [[OrganismView]]).  This will automatically use the associated name and description.
+
*-source_id : if the data source is already entered into CoGe, you can use its internal CoGe ID (available by searching for the organism in [[OrganismView]]).  This will automatically use the associated name, description, and link.
+
*-dsg_name (optional) : specify a name for the genome (dsg).  If not used, will default to the name of the organism
+
*-dsg_desc (optional) : specify a description for the genome (dsg).
+
*-seq_type_id (optional) : specify a different type of sequence for the genome (e.g. masked). By default, unmasked is assumed.  You can find a type associated with a genome in [[OrganismView]].  Don't fret if you don't know a seq_type_id.  You can create a new one (below) or pick from this list of previously created types: http://genomevolution.org/CoGe/SeqType.pl
+
*-seq_type_name (optional) : specify the name of a genomic sequence type
+
*-seq_type_desc (optional) : specify the description of a genomic sequence type
+
*-use_fasta_header (optional) : uses the entire fasta header line as the chromosome line (might be screwy when using a gene model/annotation file).
+
*-chr (optional) : set chromosome name to something specific. Used for all sequences in the file (makes sense when there is only one sequence in the file).
+
 
+
===After a fasta genome load:===
+
.
+
.
+
.
+
Creating feature of type chromosome
+
Creating feature_name chromosome contig00315 for feat 81329038
+
Adding location contig00315:(1-70, 1)
+
Loading genomic sequence for chromosome: contig00316 (9 nt)
+
Working on chromosome contig00316 of type chromosome
+
Creating feature of type chromosome
+
Creating feature_name chromosome contig00316 for feat 81329039
+
Adding location contig00316:(1-9, 1)
+
      Formatdb running /usr/bin/formatdb -p F -o T -i /opt/apache/CoGe/data/genomic_sequence/0/0/11/11229/11229.faa
+
dataset_id: 46764
+
dataset_group_id: 11229
+
 
+
When fasta_genome_loader.pl is run, you'll see a stream of text letting you know that [[genomic features]] for chromosomes are being created in CoGe, and that names, locations, etc. are being specified for those genomic features.  Note, that each chromosome is itself tracked as a [[genomic feature]] within CoGe. One set of these will be created for each chromosome (contig, etc.) loaded from you fasta file.  When all sequences are loaded, this program will create the blastable databases for the sequence, and the final lines specify the dataset_id and dataset_group_id for the newly loaded genome.  Save these numbers in your notes.  You may need them in the future (but you can always search for them in [[OrganismView]]).
+
 
+
==[[Syntenic path assembly]]==
+
Often, you'll want to take a contig level assembly and generate a complete genome by comparison to a reference genome.  CoGe has support for doing this AND printing out an assembled version of your genome.  This assembled genome may be reloaded (as a higher version) using the method listed above.  Follow this link for information on  [[syntenic path assembly]] in CoGe.
+
 
+
==Annotating the genome with [http://prodigal.ornl.gov/ Prodigal] (from Oak Ridge National Labs)==
+
[http://prodigal.ornl.gov/ Prodigal] is a great program for quickly and easily identifying gene models in bacterial genomes. To annotate a genome using the program (e.g. one on which you just performed a syntenic path assembly), you may run:
+
prodigal -f gff < KSJ2_no-split-syntenic_path_assembly.fna > prodigal_models.gff
+
 
+
==Loading gene models and annotation for a genome (well, in this case, a dataset)==
+
'''Warning:''' There is quite a bit of heterogeneity in genome model and annotation formats; even within the same format (e.g. GFF3).  Thus, there is not a one-use program in CoGe that works for all GFF3 files.  However, there is a general program that is customized for each data source's example implementation of GFF3, and many such versions exist.  Also, the program in CoGe for loading GFF3 data (gff_annotation_loader.pl), has become a bit like frankencode.  Please feel free to contact [mailto:elyons.coge@gmail.com Eric Lyons] for any help/customizations, etc.  Fortunately, once an institute has settled on a consistent format for GFF3 (or other file type), you can use the same program for all your genome loading needs.  Once this program has been appropriately customized, it is very easy to run to load the annotations/gene models into CoGe:
+
 
+
./gff_annotation_loader.pl -dsid 46765 -gff_file prodigal_models.gff
+
 
+
This will run the program, output all the actions it would perform, but not actually load the data into CoGe.  To do the load, add '''-go 1''' to the end of the command:
+
./gff_annotation_loader.pl -dsid 46765 -gff_file prodigal_models.gff -go 1
+
 
+
Note:  An example of the customization for gff_annotation_loader.pl is as follows for processing the GFF3 file produced from Prodigal:
+
*No gene or mRNA models.  These need to be created from the CDS models
+
*Add Prodigal prediction to gene name and annotation
+
 
+
==How to unscrewup==
+
===delete_dataset_group.pl===
+
If you make a mistake and need to delete a genome you've loaded, just delete the whole genome (annotations and gene models included), and reload. To delete:
+
 
+
~/projects/CoGeX/scripts/delete_dataset_group.pl -dsgid 11230 -delete_seqs
+
 
+
===Options for delete_dataset_group.pl===
+
*-dsgid : which dataset_group (genome) to delete
+
*-delete_seqs : delete the sequences from the sequence repository
+
 
+
'''NOTE:''' Make sure you get the dataset_group id (dsgid) correct!!!  Otherwise, very bad things will happen. Fortunately, there are backups, but contact [mailto:elyons.coge@gmail.com] immediately!
+

Latest revision as of 09:38, 28 May 2020

You can load your own genome sequence, annotation, and quantitative data for use with CoGe's tools. These data may be kept private and shared with collaborators, or made fully public.

  1. Register for a CyVerse user account if you don't have one: https://user.cyverse.org
    1. Add CoGe as a CyVerse Service:
    2. Screen Shot 2017-08-14 at 3.41.40 PM.png
  2. Log into CoGe: https://genomevolution.org
    1. Screen Shot 2020-05-28 at 8.52.36 AM.png
  3. Go to your User Profile Page by clicking My Data in the menu bar.
    1. Screen Shot 2020-05-28 at 8.54.09 AM.png
  4. Click New -> New Genome
    1. Screen Shot 2020-05-28 at 8.54.30 AM.png
  5. Follow this link for information on how to use LoadGenome.
    1. Once your genome is loaded into the system, you can add annotation and quantitative data to it using the LoadAnnotation and LoadExperiment features.
  6. Note: Make sure your GFF file is in the correct format for CoGe: GFF ingestion
  7. If you get an error when accessing the CyVerse Data Store, make sure that CoGe has been added as a service to your CyVerse account: https://user.cyverse.org
    1. Screen Shot 2017-08-14 at 3.41.40 PM.png