Difference between revisions of "Expression Analysis Pipeline"

From CoGepedia
Jump to: navigation, search
m
 
(10 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 +
[[File:Screen_Shot_2014-03-04_at_9.50.01_AM.png|thumb|300px|right|overview of various types of mapping of data from RNASeq]] [[File:Screen Shot 2014-04-07 at 2.27.54 PM.png|thumb|300px|Comparing the results of data mapped with GSNAP versus Tophat/Bowtie2]]
 +
 
CoGe can generate gene/transcript expression measurements given a FASTQ input and an annotated genome.  Thanks to [http://www.skraelingmountain.com/ James Schnable], creator of [http://qteller.com/ qTeller], for help developing this pipeline!
 
CoGe can generate gene/transcript expression measurements given a FASTQ input and an annotated genome.  Thanks to [http://www.skraelingmountain.com/ James Schnable], creator of [http://qteller.com/ qTeller], for help developing this pipeline!
  
 +
==Workflow==
 
When a FASTQ file of sequence reads is loaded in [[LoadExperiment]] and associated with an annotated genome, the following analysis steps are performed:
 
When a FASTQ file of sequence reads is loaded in [[LoadExperiment]] and associated with an annotated genome, the following analysis steps are performed:
 
# The FASTQ file is verified for correct format.
 
# The FASTQ file is verified for correct format.
# [http://code.google.com/p/cutadapt/ CutAdapt] is run to trim adapter sequence from the reads (parameters:  -q 25 --quality-base=64  -m 17).
+
# [http://code.google.com/p/cutadapt/ CutAdapt] is run to trim adapter sequence from the reads (parameters:  -q 25 -m 17).
# [http://research-pub.gene.com/gmap/ GMAP] is run to index the reference genome sequence.
+
## -q: Trim low-quality ends from reads before adapter removal. The algorithm is the same as the one used by BWA (Subtract CUTOFF from all qualities; compute partial sums from all indices to the end of the sequence; cut sequence at the index at which the sum is minimal)
# [http://research-pub.gene.com/gmap/ GSNAP] is run to align the reads to the reference sequence (parameters: -n 5 --format=sam  -Q  --gmap-mode=none  --nofails).
+
##-m: Discard trimmed reads that are shorter than LENGTH. Reads that are too short even before adapter removal are also discarded. In colorspace, an initial primer is not counted (default: 0).  
# [http://samtools.sourceforge.net/ SAMtools] is run to compute per-position read depth of the resulting alignment (mpileup -D  -Q 20).
+
# [http://research-pub.gene.com/gmap/ GMAP] or [http://bowtie-bio.sourceforge.net/bowtie2/index.shtml Bowtie2] is run to index the reference genome sequence, depending on your choice.
 +
# [http://research-pub.gene.com/gmap/ GSNAP] or [http://tophat.cbcb.umd.edu/ TopHat] is run to align the reads to the reference sequence (GSNAP parameters: -N 1 -n 5 --format=sam  -Q  --gmap-mode=none  --nofails, TopHat parameters: -g 1).
 +
# [http://samtools.sourceforge.net/ SAMtools] is run to compute per-position read depth of the resulting alignment (depth -q 20).
 
# [http://cufflinks.cbcb.umd.edu/ Cufflinks] is run to compte per-transcript FPKM (parameters: -p 24).
 
# [http://cufflinks.cbcb.umd.edu/ Cufflinks] is run to compte per-transcript FPKM (parameters: -p 24).
 
# The per-position read depth and per-transcript FPKM values are log transformed and normalized between 0 and 1 for loading.
 
# The per-position read depth and per-transcript FPKM values are log transformed and normalized between 0 and 1 for loading.
Line 14: Line 19:
 
TBD:  how to do this ...
 
TBD:  how to do this ...
  
[[File:Screen_Shot_2014-03-04_at_9.50.01_AM.png|thumb|300px]] [[File:Screen Shot 2014-04-07 at 2.27.54 PM.png|thumb|300px]]
 
  
===Video Tutorial===
+
==Video Tutorial==
 
{{#ev:youtube|3fNyHGB02dM}}
 
{{#ev:youtube|3fNyHGB02dM}}
  
*Demo fastq data for Arabidopsis Col-0:  
+
 
 +
==Demo fastq data for Arabidopsis Col-0:==
 
** 0.17M reads: http://de.iplantcollaborative.org/dl/d/2F807292-34CC-4C8E-96E3-3E668A304D23/test_rna_seq_data_0.17M_reads.fastq
 
** 0.17M reads: http://de.iplantcollaborative.org/dl/d/2F807292-34CC-4C8E-96E3-3E668A304D23/test_rna_seq_data_0.17M_reads.fastq
 
** 1M reads: http://de.iplantcollaborative.org/dl/d/EFD4F983-80B1-4388-94C4-AD78E73D2795/test_rna_seq_data_1M_reads.fastq
 
** 1M reads: http://de.iplantcollaborative.org/dl/d/EFD4F983-80B1-4388-94C4-AD78E73D2795/test_rna_seq_data_1M_reads.fastq
 
** 7.6M reads: http://de.iplantcollaborative.org/dl/d/9F6602D6-C66B-4C97-A72A-180AAE55AF95/test_rna_seq_data_7.6M_reads.fastq
 
** 7.6M reads: http://de.iplantcollaborative.org/dl/d/9F6602D6-C66B-4C97-A72A-180AAE55AF95/test_rna_seq_data_7.6M_reads.fastq

Latest revision as of 15:40, 5 January 2016

overview of various types of mapping of data from RNASeq
Comparing the results of data mapped with GSNAP versus Tophat/Bowtie2

CoGe can generate gene/transcript expression measurements given a FASTQ input and an annotated genome. Thanks to James Schnable, creator of qTeller, for help developing this pipeline!

Workflow

When a FASTQ file of sequence reads is loaded in LoadExperiment and associated with an annotated genome, the following analysis steps are performed:

  1. The FASTQ file is verified for correct format.
  2. CutAdapt is run to trim adapter sequence from the reads (parameters: -q 25 -m 17).
    1. -q: Trim low-quality ends from reads before adapter removal. The algorithm is the same as the one used by BWA (Subtract CUTOFF from all qualities; compute partial sums from all indices to the end of the sequence; cut sequence at the index at which the sum is minimal)
    2. -m: Discard trimmed reads that are shorter than LENGTH. Reads that are too short even before adapter removal are also discarded. In colorspace, an initial primer is not counted (default: 0).
  3. GMAP or Bowtie2 is run to index the reference genome sequence, depending on your choice.
  4. GSNAP or TopHat is run to align the reads to the reference sequence (GSNAP parameters: -N 1 -n 5 --format=sam -Q --gmap-mode=none --nofails, TopHat parameters: -g 1).
  5. SAMtools is run to compute per-position read depth of the resulting alignment (depth -q 20).
  6. Cufflinks is run to compte per-transcript FPKM (parameters: -p 24).
  7. The per-position read depth and per-transcript FPKM values are log transformed and normalized between 0 and 1 for loading.
  8. The three results (raw alignment, per-position read depth, and per-transcript FPKM) are loaded as separate Experiments into a Notebook.

Genomes for which this analysis has been performed can have features imported into qTeller. TBD: how to do this ...


Video Tutorial


Demo fastq data for Arabidopsis Col-0: