Difference between revisions of "How to load genomes into CoGe"

Revision as of 10:19, 21 February 2011

There are two general programs to run:

'fasta_genome_loader.pl:
- Loads in fasta sequences into CoGe
annotation loader:
- usually some version of gff_annotation_loader.pl or some other program for loading text based gene models and annotations

UAGC Example

The UAGC produces many genomic sequences. This is to help them streamline their procedure for loading genomes into CoGe

Get 454AllContigs.fna
1. This is the usual contig-level genome assembly from the 454 genome sequencing pipeline

First, load the fasta sequence

run fasta_genome_loader.pl

~/projects/CoGeX/scripts/fasta_genome_loader.pl \
-org_name "Acidovorax sp. strain JS42 substrain KSJ2" \
-org_desc "Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Acidovorax;" \
-source_name "University of Arizona Genetics Core" \
-source_link "http://uagc.arl.arizona.edu/" \
-ds_version .1 \
-nt KSJ2_454AllContigs.fna \
-dsg_restricted 1

Important Notes

CoGe organisms genomes by a collection of datasets (often abbreviated as ds) into a dataset_group (abbreviated as dsg). The general idea is that a genome may consist of multiple files, and we want to track the provenance of each file. If you search for a genome/organism in OrganismView, you'll see that dsg is listed as "genome", but that there is an associated dsgid with each genome.

Option Descriptions

-org_name : the name of the organism
-org_desc : the GenBank taxanomic description of the organism
-source_name : the source of the data
-source_desc (optional) : description of the source of the data
-source_link (optional) : a http:// url to the the place that generated the data (or who owns the data)
-ds_version : version number for the genome
-ds_link (optional) : a http:// url to link to the place where the data file was downloaded
-nt : path to the nucleotide
-dsg_restricted (optional) : make this genome private

Additional Options

-org_id : if the organism is already entered into CoGe, you can use its internal CoGe ID (available by searching for the organism in OrganismView). This will automatically use the associated name and description.
-source_id : if the data source is already entered into CoGe, you can use its internal CoGe ID (available by searching for the organism in OrganismView). This will automatically use the associated name, description, and link.
-dsg_name (optional) : specify a name for the genome (dsg). If not used, will default to the name of the organism
-dsg_desc (optional) : specify a description for the genome (dsg).
-seq_type_id (optional) : specify a different type of sequence for the genome (e.g. masked). By default, unmasked is assumed. You can find a type associated with a genome in OrganismView. Don't fret if you don't know a seq_type_id. You can create a new one (below) or pick from this list of previously created types:

mysql> select * from genomic_sequence_type;
+--------------------------+-----------------------------------+---------------------------------------------------------------------------------------------------------------+
| genomic_sequence_type_id | name                              | description                                                                                                   |
+--------------------------+-----------------------------------+---------------------------------------------------------------------------------------------------------------+
|                        1 | unmasked                          | unmasked sequence data                                                                                        |
|                        2 | masked repeats 50x                | repeats with more than 50x occurrence have been masked                                                        |
|                        3 | 50X mask +syntenic thread with Os | double masked: 50x repeats and non-coding sequences.  CNS sequences with Os retained                          |
|                        4 | masked repeats 40x                | repeats with more than 40x occurrence have been masked                                                        |
|                        5 | super masked repeats 50x          | repeats with more than 50x occurrence have been masked.  Additional processing was needed for these sequences |
|                        6 | te+kmer masked                    | transposons and kmer hard masked (Bao method)                                                                 |
|                        7 | masked by JGI                     | downloaded masked                                                                                             |
|                        8 | masked by genoscope               | NULL                                                                                                          |
|                        9 | RepeatMasker                      | with MIPS repeat data                                                                                         |
|                       10 | masked by Cacao Genome Database   | NULL                                                                                                          |
|                       11 | Repeat masked by Andrea Zuccolo   | NULL                                                                                                          |
|                       12 | masked by GMGC-nt                 | NULL                                                                                                          |
|                       13 | masked by GMGC                    | NULL                                                                                                          |
+--------------------------+-----------------------------------+---------------------------------------------------------------------------------------------------------------+

-seq_type_name (optional) : specify the name of a genomic sequence type
-seq_type_desc (optional) : specify the description of a genomic sequence type

@@ Line 10: / Line 10: @@
 #Get 454AllContigs.fna
 ##This is the usual contig-level genome assembly from the 454 genome sequencing pipeline
+===First, load the fasta sequence===
+''
 #run fasta_genome_loader.pl
   ~/projects/CoGeX/scripts/fasta_genome_loader.pl \
@@ Line 39: / Line 42: @@
 *-dsg_name (optional) : specify a name for the genome (dsg).  If not used, will default to the name of the organism
 *-dsg_desc (optional) : specify a description for the genome (dsg).
-*-seq_type_id (optional) : specify a different type of sequence for the genome (e.g. masked).  By default, unmasked is assumed.  Available types:
+*-seq_type_id (optional) : specify a different type of sequence for the genome (e.g. masked).  By default, unmasked is assumed.  You can find a type associated with a genome in [[OrganismView]].  Don't fret if you don't know a seq_type_id.  You can create a new one (below) or pick from this list of previously created types:
   mysql> select * from genomic_sequence_type;
   +--------------------------+-----------------------------------+---------------------------------------------------------------------------------------------------------------+
@@ Line 59: / Line 62: @@
   +--------------------------+-----------------------------------+---------------------------------------------------------------------------------------------------------------+
-*-seq_type_name (optional) : specify
+*-seq_type_name (optional) : specify the name of a genomic sequence type
+*-seq_type_desc (optional) : specify the description of a genomic sequence type

Difference between revisions of "How to load genomes into CoGe"

Revision as of 10:19, 21 February 2011

Contents

UAGC Example

First, load the fasta sequence

Important Notes

Option Descriptions

Additional Options

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

CoGe links

Sites Linked to CoGe

Sites Linked from CoGe

Toolbox