|
|
(20 intermediate revisions by the same user not shown) |
Line 1: |
Line 1: |
| There are two general programs to run:
| | You can load your own genome sequence, annotation, and quantitative data for use with CoGe's tools. These data may be kept private and shared with collaborators, or made fully public. |
| *'''fasta_genome_loader.pl''':
| |
| **Loads in fasta sequences into CoGe
| |
| *annotation loader:
| |
| **usually some version of '''gff_annotation_loader.pl''' or some other program for loading text based gene models and annotations
| |
|
| |
|
| | | #Register for a CyVerse user account if you don't have one: https://user.cyverse.org |
| =UAGC Example=
| | ## Add CoGe as a CyVerse Service: |
| The UAGC produces many genomic sequences. This is to help them streamline their procedure for loading genomes into CoGe
| | ## [[File:Screen Shot 2017-08-14 at 3.41.40 PM.png|400px]] |
| #Get 454AllContigs.fna | | #Log into CoGe: https://genomevolution.org |
| ##This is the usual contig-level genome assembly from the 454 genome sequencing pipeline | | ## [[File:Screen Shot 2020-05-28 at 8.52.36 AM.png|400px]] |
| ==First, load the fasta sequence==
| | #Go to your [[User|User Profile Page]] by clicking My Data in the menu bar. |
| '''Note:'''Your fasta sequence headers will be used as the chromosome or contig (scaffold, superscaffold, etc) name. These are parsed to use the first set of all non-whitespace characters. (i.e. everything after the first space in a header name will not be used). Make sure that these "chromosome" names exactly match the ones used in the gene model/annotation file (e.g. GFF3). This is how gene models are matched to their associated chromosome of residence.
| | ## [[File:Screen Shot 2020-05-28 at 8.54.09 AM.png|400px]] |
| | | #Click '''New''' -> '''New Genome''' |
| #run fasta_genome_loader.pl | | ##[[File:Screen Shot 2020-05-28 at 8.54.30 AM.png|400px]] |
| ~/projects/CoGeX/scripts/fasta_genome_loader.pl \ | | # Follow [[LoadGenome|this link]] for information on how to use [[LoadGenome]]. |
| -org_name "Acidovorax sp. strain JS42 substrain KSJ2" \
| | ## Once your genome is loaded into the system, you can add annotation and quantitative data to it using the [[LoadAnnotation]] and [[LoadExperiment]] features. |
| -org_desc "Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Acidovorax;" \
| | #'''Note:''' Make sure your GFF file is in the correct format for CoGe: [[GFF ingestion]] |
| -source_name "University of Arizona Genetics Core" \
| | #If you get an error when accessing the CyVerse Data Store, make sure that CoGe has been added as a service to your CyVerse account: https://user.cyverse.org |
| -source_link "http://uagc.arl.arizona.edu/" \
| | ## [[File:Screen Shot 2017-08-14 at 3.41.40 PM.png|400px]] |
| -ds_version .1 \
| |
| -nt KSJ2_454AllContigs.fna \
| |
| -dsg_restricted 1
| |
| | |
| ===Important Notes===
| |
| *CoGe organisms genomes by a collection of datasets (often abbreviated as '''ds''') into a dataset_group (abbreviated as '''dsg'''). The general idea is that a genome may consist of multiple files, and we want to track the provenance of each file. If you search for a genome/organism in [[OrganismView]], you'll see that dsg is listed as "genome", but that there is an associated dsgid with each genome.
| |
| | |
| ===Option Descriptions===
| |
| *-org_name : the name of the organism
| |
| *-org_desc : the GenBank taxanomic description of the organism
| |
| *-source_name : the source of the data
| |
| *-source_desc (optional) : description of the source of the data
| |
| *-source_link (optional) : a http:// url to the the place that generated the data (or who owns the data)
| |
| *-ds_version : version number for the genome
| |
| *-ds_link (optional) : a http:// url to link to the place where the data file was downloaded
| |
| *-nt : path to the nucleotide
| |
| *-dsg_restricted (optional) : make this genome private
| |
| | |
| ====Additional Options====
| |
| *-org_id : if the organism is already entered into CoGe, you can use its internal CoGe ID (available by searching for the organism in [[OrganismView]]). This will automatically use the associated name and description.
| |
| *-source_id : if the data source is already entered into CoGe, you can use its internal CoGe ID (available by searching for the organism in [[OrganismView]]). This will automatically use the associated name, description, and link.
| |
| *-dsg_name (optional) : specify a name for the genome (dsg). If not used, will default to the name of the organism
| |
| *-dsg_desc (optional) : specify a description for the genome (dsg).
| |
| *-seq_type_id (optional) : specify a different type of sequence for the genome (e.g. masked). By default, unmasked is assumed. You can find a type associated with a genome in [[OrganismView]]. Don't fret if you don't know a seq_type_id. You can create a new one (below) or pick from this list of previously created types:
| |
| mysql> select * from genomic_sequence_type;
| |
| +--------------------------+-----------------------------------+---------------------------------------------------------------------------------------------------------------+ | |
| | genomic_sequence_type_id | name | description |
| |
| +--------------------------+-----------------------------------+---------------------------------------------------------------------------------------------------------------+
| |
| | 1 | unmasked | unmasked sequence data |
| |
| | 2 | masked repeats 50x | repeats with more than 50x occurrence have been masked |
| |
| | 3 | 50X mask +syntenic thread with Os | double masked: 50x repeats and non-coding sequences. CNS sequences with Os retained |
| |
| | 4 | masked repeats 40x | repeats with more than 40x occurrence have been masked |
| |
| | 5 | super masked repeats 50x | repeats with more than 50x occurrence have been masked. Additional processing was needed for these sequences |
| |
| | 6 | te+kmer masked | transposons and kmer hard masked (Bao method) |
| |
| | 7 | masked by JGI | downloaded masked |
| |
| | 8 | masked by genoscope | NULL |
| |
| | 9 | RepeatMasker | with MIPS repeat data |
| |
| | 10 | masked by Cacao Genome Database | NULL |
| |
| | 11 | Repeat masked by Andrea Zuccolo | NULL |
| |
| | 12 | masked by GMGC-nt | NULL |
| |
| | 13 | masked by GMGC | NULL |
| |
| +--------------------------+-----------------------------------+---------------------------------------------------------------------------------------------------------------+
| |
| | |
| *-seq_type_name (optional) : specify the name of a genomic sequence type
| |
| *-seq_type_desc (optional) : specify the description of a genomic sequence type
| |
| *-use_fasta_header (optional) : uses the entire fasta header line as the chromosome line (might be screwy when using a gene model/annotation file).
| |
| | |
| ===After a fasta genome load:===
| |
| .
| |
| .
| |
| .
| |
| Creating feature of type chromosome
| |
| Creating feature_name chromosome contig00315 for feat 81329038
| |
| Adding location contig00315:(1-70, 1)
| |
| Loading genomic sequence for chromosome: contig00316 (9 nt)
| |
| Working on chromosome contig00316 of type chromosome
| |
| Creating feature of type chromosome
| |
| Creating feature_name chromosome contig00316 for feat 81329039
| |
| Adding location contig00316:(1-9, 1)
| |
| Formatdb running /usr/bin/formatdb -p F -o T -i /opt/apache/CoGe/data/genomic_sequence/0/0/11/11229/11229.faa
| |
| dataset_id: 46764
| |
| dataset_group_id: 11229
| |