|
|
(25 intermediate revisions by the same user not shown) |
Line 1: |
Line 1: |
| There are two general programs to run:
| | You can load your own genome sequence, annotation, and quantitative data for use with CoGe's tools. These data may be kept private and shared with collaborators, or made fully public. |
| *'''fasta_genome_loader.pl'':
| |
| **Loads in fasta sequences into CoGe
| |
| *annotation loader:
| |
| **usually some version of '''gff_annotation_loader.pl''' or some other program for loading text based gene models and annotations
| |
|
| |
|
| | | #Register for a CyVerse user account if you don't have one: https://user.cyverse.org |
| ==UAGC Example==
| | ## Add CoGe as a CyVerse Service: |
| The UAGC produces many genomic sequences. This is to help them streamline their procedure for loading genomes into CoGe
| | ## [[File:Screen Shot 2017-08-14 at 3.41.40 PM.png|400px]] |
| #Get 454AllContigs.fna | | #Log into CoGe: https://genomevolution.org |
| ##This is the usual contig-level genome assembly from the 454 genome sequencing pipeline | | ## [[File:Screen Shot 2020-05-28 at 8.52.36 AM.png|400px]] |
| #run fasta_genome_loader.pl | | #Go to your [[User|User Profile Page]] by clicking My Data in the menu bar. |
| ~/projects/CoGeX/scripts/fasta_genome_loader.pl \ | | ## [[File:Screen Shot 2020-05-28 at 8.54.09 AM.png|400px]] |
| -org_name "Acidovorax sp. strain JS42 substrain KSJ2" \
| | #Click '''New''' -> '''New Genome''' |
| -org_desc "Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Acidovorax;" \
| | ##[[File:Screen Shot 2020-05-28 at 8.54.30 AM.png|400px]] |
| -source_name "University of Arizona Genetics Core" \
| | # Follow [[LoadGenome|this link]] for information on how to use [[LoadGenome]]. |
| -source_link "http://uagc.arl.arizona.edu/" \
| | ## Once your genome is loaded into the system, you can add annotation and quantitative data to it using the [[LoadAnnotation]] and [[LoadExperiment]] features. |
| -ds_version .1 \
| | #'''Note:''' Make sure your GFF file is in the correct format for CoGe: [[GFF ingestion]] |
| -nt KSJ2_454AllContigs.fna \
| | #If you get an error when accessing the CyVerse Data Store, make sure that CoGe has been added as a service to your CyVerse account: https://user.cyverse.org |
| -dsg_restricted 1
| | ## [[File:Screen Shot 2017-08-14 at 3.41.40 PM.png|400px]] |
| | |
| ===Important Notes===
| |
| *CoGe organisms genomes by a collection of datasets (often abbreviated as '''ds''') into a dataset_group (abbreviated as '''dsg'''). The general idea is that a genome may consist of multiple files, and we want to track the provenance of each file. If you search for a genome/organism in [[OrganismView]], you'll see that dsg is listed as "genome", but that there is an associated dsgid with each genome.
| |
| | |
| ===Option Descriptions===
| |
| *-org_name : the name of the organism
| |
| *-org_desc : the GenBank taxanomic description of the organism
| |
| *-source_name : the source of the data
| |
| *-source_desc (optional) : description of the source of the data
| |
| *-source_link (optional) : a http:// url to the the place that generated the data (or who owns the data)
| |
| *-ds_version : version number for the genome
| |
| *-ds_link (optional) : a http:// url to link to the place where the data file was downloaded
| |
| *-nt : path to the nucleotide
| |
| *-dsg_restricted (optional) : make this genome private
| |
| | |
| ====Additional Options====
| |
| *-org_id : if the organism is already entered into CoGe, you can use its internal CoGe ID (available by searching for the organism in [[OrganismView]]). This will automatically use the associated name and description.
| |
| *-source_id : if the data source is already entered into CoGe, you can use its internal CoGe ID (available by searching for the organism in [[OrganismView]]). This will automatically use the associated name, description, and link.
| |
| *-dsg_name (optional) : specify a name for the genome (dsg). If not used, will default to the name of the organism
| |
| *-dsg_desc (optional) : specify a description for the genome (dsg).
| |
| *-seq_type_id (optional) : specify a different type of sequence for the genome (e.g. masked). By default, unmasked is assumed. Available types:
| |
| mysql> select * from genomic_sequence_type;
| |
| +--------------------------+-----------------------------------+---------------------------------------------------------------------------------------------------------------+
| |
| | genomic_sequence_type_id | name | description |
| |
| +--------------------------+-----------------------------------+---------------------------------------------------------------------------------------------------------------+
| |
| | 1 | unmasked | unmasked sequence data |
| |
| | 2 | masked repeats 50x | repeats with more than 50x occurrence have been masked |
| |
| | 3 | 50X mask +syntenic thread with Os | double masked: 50x repeats and non-coding sequences. CNS sequences with Os retained |
| |
| | 4 | masked repeats 40x | repeats with more than 40x occurrence have been masked |
| |
| | 5 | super masked repeats 50x | repeats with more than 50x occurrence have been masked. Additional processing was needed for these sequences |
| |
| | 6 | te+kmer masked | transposons and kmer hard masked (Bao method) |
| |
| | 7 | masked by JGI | downloaded masked |
| |
| | 8 | masked by genoscope | NULL |
| |
| | 9 | RepeatMasker | with MIPS repeat data |
| |
| | 10 | masked by Cacao Genome Database | NULL |
| |
| | 11 | Repeat masked by Andrea Zuccolo | NULL |
| |
| | 12 | masked by GMGC-nt | NULL |
| |
| | 13 | masked by GMGC | NULL |
| |
| +--------------------------+-----------------------------------+---------------------------------------------------------------------------------------------------------------+
| |
| | |
| *-seq_type_name (optional) : specify
| |