|
|
(22 intermediate revisions by the same user not shown) |
Line 1: |
Line 1: |
− | There are two general programs to run:
| + | You can load your own genome sequence, annotation, and quantitative data for use with CoGe's tools. These data may be kept private and shared with collaborators, or made fully public. |
− | *'''fasta_genome_loader.pl''':
| + | |
− | **Loads in fasta sequences into CoGe
| + | |
− | *annotation loader:
| + | |
− | **usually some version of '''gff_annotation_loader.pl''' or some other program for loading text based gene models and annotations
| + | |
| | | |
− | | + | #Register for a CyVerse user account if you don't have one: https://user.cyverse.org |
− | ==UAGC Example==
| + | ## Add CoGe as a CyVerse Service: |
− | The UAGC produces many genomic sequences. This is to help them streamline their procedure for loading genomes into CoGe
| + | ## [[File:Screen Shot 2017-08-14 at 3.41.40 PM.png|400px]] |
− | #Get 454AllContigs.fna | + | #Log into CoGe: https://genomevolution.org |
− | ##This is the usual contig-level genome assembly from the 454 genome sequencing pipeline | + | ## [[File:Screen Shot 2020-05-28 at 8.52.36 AM.png|400px]] |
− | ===First, load the fasta sequence===
| + | #Go to your [[User|User Profile Page]] by clicking My Data in the menu bar. |
− | '''Note:'''Your fasta sequence headers will be used as the chromosome or contig (scaffold, superscaffold, etc) name. These are parsed to use the first set of all non-whitespace characters. (i.e. everything after the first space in a header name will not be used). Make sure that these "chromosome" names exactly match the ones used in the gene model/annotation file (e.g. GFF3). This is how gene models are matched to their associated chromosome of residence.
| + | ## [[File:Screen Shot 2020-05-28 at 8.54.09 AM.png|400px]] |
− | | + | #Click '''New''' -> '''New Genome''' |
− | #run fasta_genome_loader.pl | + | ##[[File:Screen Shot 2020-05-28 at 8.54.30 AM.png|400px]] |
− | ~/projects/CoGeX/scripts/fasta_genome_loader.pl \ | + | # Follow [[LoadGenome|this link]] for information on how to use [[LoadGenome]]. |
− | -org_name "Acidovorax sp. strain JS42 substrain KSJ2" \
| + | ## Once your genome is loaded into the system, you can add annotation and quantitative data to it using the [[LoadAnnotation]] and [[LoadExperiment]] features. |
− | -org_desc "Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Acidovorax;" \
| + | #'''Note:''' Make sure your GFF file is in the correct format for CoGe: [[GFF ingestion]] |
− | -source_name "University of Arizona Genetics Core" \
| + | #If you get an error when accessing the CyVerse Data Store, make sure that CoGe has been added as a service to your CyVerse account: https://user.cyverse.org |
− | -source_link "http://uagc.arl.arizona.edu/" \
| + | ## [[File:Screen Shot 2017-08-14 at 3.41.40 PM.png|400px]] |
− | -ds_version .1 \
| + | |
− | -nt KSJ2_454AllContigs.fna \
| + | |
− | -dsg_restricted 1
| + | |
− | | + | |
− | ===Important Notes===
| + | |
− | *CoGe organisms genomes by a collection of datasets (often abbreviated as '''ds''') into a dataset_group (abbreviated as '''dsg'''). The general idea is that a genome may consist of multiple files, and we want to track the provenance of each file. If you search for a genome/organism in [[OrganismView]], you'll see that dsg is listed as "genome", but that there is an associated dsgid with each genome.
| + | |
− | | + | |
− | ===Option Descriptions===
| + | |
− | *-org_name : the name of the organism
| + | |
− | *-org_desc : the GenBank taxanomic description of the organism
| + | |
− | *-source_name : the source of the data
| + | |
− | *-source_desc (optional) : description of the source of the data
| + | |
− | *-source_link (optional) : a http:// url to the the place that generated the data (or who owns the data)
| + | |
− | *-ds_version : version number for the genome
| + | |
− | *-ds_link (optional) : a http:// url to link to the place where the data file was downloaded
| + | |
− | *-nt : path to the nucleotide
| + | |
− | *-dsg_restricted (optional) : make this genome private
| + | |
− | | + | |
− | ====Additional Options====
| + | |
− | *-org_id : if the organism is already entered into CoGe, you can use its internal CoGe ID (available by searching for the organism in [[OrganismView]]). This will automatically use the associated name and description.
| + | |
− | *-source_id : if the data source is already entered into CoGe, you can use its internal CoGe ID (available by searching for the organism in [[OrganismView]]). This will automatically use the associated name, description, and link.
| + | |
− | *-dsg_name (optional) : specify a name for the genome (dsg). If not used, will default to the name of the organism
| + | |
− | *-dsg_desc (optional) : specify a description for the genome (dsg).
| + | |
− | *-seq_type_id (optional) : specify a different type of sequence for the genome (e.g. masked). By default, unmasked is assumed. You can find a type associated with a genome in [[OrganismView]]. Don't fret if you don't know a seq_type_id. You can create a new one (below) or pick from this list of previously created types:
| + | |
− | mysql> select * from genomic_sequence_type; | + | |
− | +--------------------------+-----------------------------------+---------------------------------------------------------------------------------------------------------------+
| + | |
− | | genomic_sequence_type_id | name | description |
| + | |
− | +--------------------------+-----------------------------------+---------------------------------------------------------------------------------------------------------------+
| + | |
− | | 1 | unmasked | unmasked sequence data |
| + | |
− | | 2 | masked repeats 50x | repeats with more than 50x occurrence have been masked |
| + | |
− | | 3 | 50X mask +syntenic thread with Os | double masked: 50x repeats and non-coding sequences. CNS sequences with Os retained |
| + | |
− | | 4 | masked repeats 40x | repeats with more than 40x occurrence have been masked |
| + | |
− | | 5 | super masked repeats 50x | repeats with more than 50x occurrence have been masked. Additional processing was needed for these sequences |
| + | |
− | | 6 | te+kmer masked | transposons and kmer hard masked (Bao method) |
| + | |
− | | 7 | masked by JGI | downloaded masked |
| + | |
− | | 8 | masked by genoscope | NULL |
| + | |
− | | 9 | RepeatMasker | with MIPS repeat data |
| + | |
− | | 10 | masked by Cacao Genome Database | NULL |
| + | |
− | | 11 | Repeat masked by Andrea Zuccolo | NULL |
| + | |
− | | 12 | masked by GMGC-nt | NULL |
| + | |
− | | 13 | masked by GMGC | NULL |
| + | |
− | +--------------------------+-----------------------------------+---------------------------------------------------------------------------------------------------------------+
| + | |
− | | + | |
− | *-seq_type_name (optional) : specify the name of a genomic sequence type
| + | |
− | *-seq_type_desc (optional) : specify the description of a genomic sequence type
| + | |
− | *-use_fasta_header (optional) : uses the entire fasta header line as the chromosome line (might be screwy when using a gene model/annotation file).
| + | |
You can load your own genome sequence, annotation, and quantitative data for use with CoGe's tools. These data may be kept private and shared with collaborators, or made fully public.