Difference between revisions of "LoadGenome"

From CoGepedia
Jump to: navigation, search
(Data Format)
(Data Format)
 
Line 51: Line 51:
 
# Unique section names.
 
# Unique section names.
 
# Unix-style end-of-line characters, not Windows-style
 
# Unix-style end-of-line characters, not Windows-style
 +
# Individual input files can be in a compressed format (.gz, .bz2).  Multiple input files can be archived (.tar, .tar.gz, .zip).
  
 
The genome load will fail with an error message if the FASTA file(s) do not meet the conditions above.
 
The genome load will fail with an error message if the FASTA file(s) do not meet the conditions above.
  
 
[[Example data for EPIC-CoGe]]: example datasets for loading your own genome (plus other genomic data)
 
[[Example data for EPIC-CoGe]]: example datasets for loading your own genome (plus other genomic data)

Latest revision as of 12:22, 3 February 2017

The LoadGenome page lets you add your own genome for use with CoGe's various tools. Once your genome is loaded, you can add annotation and experiment data.

Important Information

  • You must be logged into CoGe to use LoadGenome.
  • Your genome will be automatically assigned to your user account.
  • You can share your genome with other CoGe users or the public (only you can see it by default)
  • The input FASTA files must meet certain requirements (see Data Format section below)
  • Please see CoGe's Data policy

Quick Start

  1. Organism:  select an organism for your genome. If your organism does not exist, please create it by clicking New.
  2. Version: what version is your genome (please use numbers).  Example: 0.01
  3. Type: What kind of sequence is it (most are unmasked)
  4. Source: Where did you get your data?  This could be you, your lab, your university, a sequencing center, your collaborator.  Please give credit and provide a link to the source.
  5. Restricted:
    1. If this is selected, only you have access to the data after it is loaded.
    2. If this is not selected, your data is fully public.
    3. Note: you can change this at any point (make it public, make it private, or share with certain users ... see User for instructions).
  6. Add data files: You can specify multiple FASTA data files for your genome. Only properly formatted FASTA files are allowed (see Data Format section below for details). At a later time GFF annotation files can be loaded for the genome.
    1. iPlant Data Store: Fast, fast, fast. Also you can transfer data files that are larger than 2GB. See below for more info.
    2. FTP/HTTP: May be slow, but useful if the data is already posted somewhere. Note that HTTP usually doesn't work for files greater than 2GB
    3. NCBI: specify genbank accessions for a genome. Note: genomes retrieved from NCBI are made fully public
    4. Upload: upload from your computer. Note: Doesn't usually work for files larger than 2GB
  7. Click Load Genome when everything is ready!

iPlant Data Store

We prefer to use iPlant's Data Store to transfer your data to CoGe because it is easy to use (like dropbox), has a lot of free storage for scientists, and is very fast. There are two ways to access the Data Store from CoGe.

  1. Quick-share link
  2. The coge_data/ folder

Quick-share link

Data can be imported into CoGe using a Data Store quick-share link. Follow the instructions below to generate a quick-share link, then paste the link into the URL field under "FTP/HTTP" data source in LoadGenome.

  1. Upload Fasta and GFF (if available) files to iPlant Data Store
    1. Quick Start Guide: https://pods.iplantcollaborative.org/wiki/display/start/Data+Store+Quick+Start
    2. OBSOLETE: Use Davis to generate a quick-share link to let others download the data: (link to old video: http://www.youtube.com/watch?v=CoHjYWSvrPA)

The coge_data/ folder

CoGe has default access to any files in the folder named coge_data/ in your Data Store. Simple place your data files into that directory using iCommands or the Discovery Environment, then select the files under "Data Store" in LoadGenome.

Data Format

The input files must be standard FASTA format and meet the following requirements:

  1. Note that leading "chr", "lcl", and "gi" prefixes will be removed from each section name.
  2. Unique section names.
  3. Unix-style end-of-line characters, not Windows-style
  4. Individual input files can be in a compressed format (.gz, .bz2). Multiple input files can be archived (.tar, .tar.gz, .zip).

The genome load will fail with an error message if the FASTA file(s) do not meet the conditions above.

Example data for EPIC-CoGe: example datasets for loading your own genome (plus other genomic data)