Difference between revisions of "LoadExperiment"

From CoGepedia
Jump to: navigation, search
(VCF File Format)
(VCF File Format)
Line 111: Line 111:
 
Standard VCF 4.1 format as defined here: http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41
 
Standard VCF 4.1 format as defined here: http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41
  
In short, it is a tab-separated file that contains the following column headings:
+
In short, it is a tab-separated file that is separated into two sections: contains the following column headings:
  
 
#Chromosome - name of the scaffold, pseudomolecule, or chromosome the SNP call is located on
 
#Chromosome - name of the scaffold, pseudomolecule, or chromosome the SNP call is located on
 
#Position - where the SNP is called on the chromosome in bp
 
#Position - where the SNP is called on the chromosome in bp
#ID - a semicolon separated  
+
#ID - a semicolon separated list of unique identifiers for the SNP. If there is no identifier a period (".") should be recorded. No white space or semicolons.
#Reference
+
#Reference - the base that exists on the reference genome (A,G,C,T,N). Multiple nucleotides are shown for indels or complex substitutions.
#Alt
+
#Alt - comma separated list of nucleotides that are different in an individual from the reference (A,G,C,T,N,*). The asterisk denotes a deletion.
#Quality
+
#Quality - Phred quality score for the alternate nucleotide call.
 
#Metadata information lines
 
#Metadata information lines
 
##Filter
 
##Filter

Revision as of 14:16, 4 December 2015

LoadExperiment enables you to load experimental quantitative, polymorphism, or alignment data for a genome in CoGe. Many common file formats are supported. The data can then be viewed alongside genome sequence/annotation in GenomeView. The LoadExperiment "wizard" guides you through each step:

  1. Describe your experiment
  2. Add data
  3. Select Options
  4. Review and Submit


Metadata: describe your experiment

The first step in loading an experiment into CoGe is to describe it with metadata. CoGe requires a minimum set of metadata fields:

Enter metadata that describes the experiment
  • Name: Name of experiment
  • Description: Description of experiment
  • Version: Version of experiment
  • Source: Where is the data from? This could be you, your lab, your university, a sequencing center, your collaborator.
  • Restricted: Is this experiment public or restricted to you and your collaborators
  • Genome: Select the appropriate genome from CoGe to associate file with

Add data

Select data files

You can select and retrieve data file located at:

  • The iPlant Data Store
    • CoGe can only read data from the 'coge_data' directory in your iPlant Data Store. You must put your data into that directory.
  • An FTP server
  • Your computer (Upload)
    • Not recommended for large files!

Note: Typically only a single data file can be selected. The exception is FASTQ formatted files for which multiple files can be selected.

Data Formats and Track Types

LoadExperiment supports several data file formats depending on the data type:

  • Quantitative data
    Quantitative track
    • Comma-separated (CSV) file format
    • Tab-separated (TSV) file format
    • BED file format
    • WIG file format
  • Marker data
    Marker track
    • GFF/GTF file format
  • Polymorphism (SNP) data
    SNP track
    • Variant Call Format (VCF) file format
  • Alignment data
    Alignment track
    • BAM file format
    • FASTQ file format

Each of these file formats are described below in their own separate section. The file type can be auto-detected by LoadExperiment if the file name ends with the expected extension (.csv, .tsv, .bed, .gff, .gtf, .vcf, .bam). Files can be compressed (.zip, .gz) and still have their type auto-detected (e.g., mydata.bed.gz). For non-standard file name extensions, you can force a specific type by selecting it from the list of file types.

CSV File Format

This is a comma-delimited file that contains the following columns

  • Chromosome (string)
  • Start position (integer)
  • Stop position (integer)
  • Chromosome Strand (1 or -1)
  • Measurement Value must be between [0-1] (real number; inclusive)
  • Second Value (OPTIONAL): can store a second value such as an expect value (real number)
#CHR,START,STOP,STRAND,VALUE1(0-1),VALUE2(ANY-ANY)
1,11486,12316,1,0.181430277220112,7.3980806218146
1,27309,28272,1,0.944373742485446,5.08225285439412
1,32484,32978,1,0.328500324191726,1.97719838086201
1,41942,42508,-1,0.825027233105203,6.56057592312617
1,56394,57527,-1,0.183234367788511,0.795527328556531
1,67705,68809,-1,0.956523086778851,5.20992343466606
1,71144,72409,1,0.42955128220331,1.80604269639474
1,81671,82833,1,0.626003507696723,2.77834108023821
1,86467,87623,-1,0.0878653961575928,7.42843749315945

TSV File Format

Same as CSV format but with tab delimiters instead of commas.

BED File Format

Two types are supported:

Note that the 0-based coordinates of the BED format will be translated into 1-based within CoGe. Measurement values must be between [0,1].

WIG File Format

Standard WIG format as defined here: http://genome.ucsc.edu/goldenpath/help/wiggle.html

Only the "variableStep" type is supported. The "fixedStep" type is not supported and will cause the load to fail with an error.

Start and end coordinates are expected to be 1-based. Measurement values must be between [0,1].

GFF File Format

Standard GFF3 format as defined here: http://gmod.org/wiki/GFF3

Only the seqid, start, end, score, strand, and attribute columns are used (column numbers 1, 4, 5, 6, 7, 9 respectively).

VCF File Format

Standard VCF 4.1 format as defined here: http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

In short, it is a tab-separated file that is separated into two sections: contains the following column headings:

  1. Chromosome - name of the scaffold, pseudomolecule, or chromosome the SNP call is located on
  2. Position - where the SNP is called on the chromosome in bp
  3. ID - a semicolon separated list of unique identifiers for the SNP. If there is no identifier a period (".") should be recorded. No white space or semicolons.
  4. Reference - the base that exists on the reference genome (A,G,C,T,N). Multiple nucleotides are shown for indels or complex substitutions.
  5. Alt - comma separated list of nucleotides that are different in an individual from the reference (A,G,C,T,N,*). The asterisk denotes a deletion.
  6. Quality - Phred quality score for the alternate nucleotide call.
  7. Metadata information lines
    1. Filter
    2. Info -
    3. Format -
  8. Individuals - can be any number of columns describing each individual in the population of interest

VCF for population genetics calculations

Chromosome	Position ID	Reference  Alt	  Quality  Filter  INFO	   Format  Indiv1                   Indiv2	
scaffold_1      114      .       A         .       52.8    .       .       GT:DP           0/0:23                   0/0:19  
scaffold_1      116      .       C         T       151.24  .       .       GT:AD:DP:GQ:PL  0/0:23,0:23:54:0,54,392  0/0:17,0:19:24:0,24,231 
scaffold_1      117      .       C         A       704.91  .       .       GT:AD:DP:GQ:PL  0/0:25,0:25:57:0,57,413  0/0:18,2:20:8:0,8,193   
scaffold_1      118      .       T         .       52.83   .       .       GT:DP           0/0:25                   0/0:20  
scaffold_1      119      .       A         .       14.95   LowQual .       GT:DP           0/0:26                   0/0:20  
scaffold_1      120      .       T         G       250.44  .       .       GT:AD:DP:GQ:PL  0/1:16,3:19:8:8,0,72     0/1:29,10:40:77:77,0,318 
scaffold_1      121      .       A         T       49.94   .       .       GT:AD:DP:GQ:PL  0/0:39,0:39:63:0,63,473  0/0:18,0:18:6:0,6,43    
scaffold_1      122      .       C         .       15.89   LowQual .       GT:DP           0/0:28                   0/0:20  
scaffold_1      123      .       C         .       49.89   .       .       GT:DP           0/0:29                   0/0:20

BAM File Format

Standard BAM format.

FASTQ Data

RNAseq reads in FASTQ format.

Select Options for FastQ data processing

Select options

This step displays options specific to the type of data that you selected.

The most complicated options are for FASTQ input files. CoGe provides multiple downstream analyses in addition to mapping reads to the reference sequence.

Expression Analysis Pipeline

The Expression Analysis Pipeline developed by James Schnable for his qTeller project.

SNP Identification Pipeline

Multiple methods of SNP identification are available to choose from:

Review and Submit

This final step simply displays of summary of the metadata, data, and options you've selected. To begin the load click the "Start Loading" button.

Troubleshooting

Make sure that these basic requirements are followed:

  • The file format must be one of those listed above.
  • The chromosome names used in the data files exactly match those for the genome. Mismatches will cause the load to fail.
  • For text files, such as .CSV and .VCF, the newline (EOL) characters must be Unix-compliant (LF character, not CRLF) , see this article. This is sometimes a problem when loading text files created in Excel on Windows OS.

Bulk Loading

Please contact the CoGe Team if you have large numbers of experiments you wish to load and we can help you with the bulk loading.

Demo Data

These data are smaller than "real" datasets, but allow you to learn/test the system quickly (or use for demo's teaching)

  • TODO