Difference between revisions of "Expression Analysis Pipeline"

Latest revision as of 15:40, 5 January 2016

overview of various types of mapping of data from RNASeq

Comparing the results of data mapped with GSNAP versus Tophat/Bowtie2

CoGe can generate gene/transcript expression measurements given a FASTQ input and an annotated genome. Thanks to James Schnable, creator of qTeller, for help developing this pipeline!

Workflow

When a FASTQ file of sequence reads is loaded in LoadExperiment and associated with an annotated genome, the following analysis steps are performed:

The FASTQ file is verified for correct format.
CutAdapt is run to trim adapter sequence from the reads (parameters: -q 25 -m 17).
1. -q: Trim low-quality ends from reads before adapter removal. The algorithm is the same as the one used by BWA (Subtract CUTOFF from all qualities; compute partial sums from all indices to the end of the sequence; cut sequence at the index at which the sum is minimal)
2. -m: Discard trimmed reads that are shorter than LENGTH. Reads that are too short even before adapter removal are also discarded. In colorspace, an initial primer is not counted (default: 0).
GMAP or Bowtie2 is run to index the reference genome sequence, depending on your choice.
GSNAP or TopHat is run to align the reads to the reference sequence (GSNAP parameters: -N 1 -n 5 --format=sam -Q --gmap-mode=none --nofails, TopHat parameters: -g 1).
SAMtools is run to compute per-position read depth of the resulting alignment (depth -q 20).
Cufflinks is run to compte per-transcript FPKM (parameters: -p 24).
The per-position read depth and per-transcript FPKM values are log transformed and normalized between 0 and 1 for loading.
The three results (raw alignment, per-position read depth, and per-transcript FPKM) are loaded as separate Experiments into a Notebook.

Genomes for which this analysis has been performed can have features imported into qTeller. TBD: how to do this ...

@@ Line 1: / Line 1: @@
+[[File:Screen_Shot_2014-03-04_at_9.50.01_AM.png|thumb|300px|right|overview of various types of mapping of data from RNASeq]] [[File:Screen Shot 2014-04-07 at 2.27.54 PM.png|thumb|300px|Comparing the results of data mapped with GSNAP versus Tophat/Bowtie2]]
 CoGe can generate gene/transcript expression measurements given a FASTQ input and an annotated genome.  Thanks to [http://www.skraelingmountain.com/ James Schnable], creator of [http://qteller.com/ qTeller], for help developing this pipeline!
+==Workflow==
 When a FASTQ file of sequence reads is loaded in [[LoadExperiment]] and associated with an annotated genome, the following analysis steps are performed:
 # The FASTQ file is verified for correct format.
-# [http://code.google.com/p/cutadapt/ CutAdapt] is run to trim adapter sequence from the reads (parameters:  -q 25 --quality-base=64  -m 17).
+# [http://code.google.com/p/cutadapt/ CutAdapt] is run to trim adapter sequence from the reads (parameters:  -q 25 -m 17).
-# [http://research-pub.gene.com/gmap/ GMAP] is run to index the reference genome sequence.
+## -q: Trim low-quality ends from reads before adapter removal. The algorithm is the same as the one used by BWA (Subtract CUTOFF from all qualities; compute partial sums from all indices to the end of the sequence; cut sequence at the index at which the sum is minimal)
-# [http://research-pub.gene.com/gmap/ GSNAP] is run to align the reads to the reference sequence (parameters: -n 5 --format=sam  -Q  --gmap-mode=none  --nofails).
+##-m: Discard trimmed reads that are shorter than LENGTH. Reads that are too short even before adapter removal are also discarded. In colorspace, an initial primer is not counted (default: 0).
-# [http://samtools.sourceforge.net/ SAMtools] is run to compute per-position read depth of the resulting alignment (mpileup -D  -Q 20).
+# [http://research-pub.gene.com/gmap/ GMAP] or [http://bowtie-bio.sourceforge.net/bowtie2/index.shtml Bowtie2] is run to index the reference genome sequence, depending on your choice.
+# [http://research-pub.gene.com/gmap/ GSNAP] or [http://tophat.cbcb.umd.edu/ TopHat] is run to align the reads to the reference sequence (GSNAP parameters: -N 1 -n 5 --format=sam  -Q  --gmap-mode=none  --nofails, TopHat parameters: -g 1).
+# [http://samtools.sourceforge.net/ SAMtools] is run to compute per-position read depth of the resulting alignment (depth -q 20).
 # [http://cufflinks.cbcb.umd.edu/ Cufflinks] is run to compte per-transcript FPKM (parameters: -p 24).
 # The per-position read depth and per-transcript FPKM values are log transformed and normalized between 0 and 1 for loading.
@@ Line 14: / Line 19: @@
 TBD:  how to do this ...
-[[File:Screen_Shot_2014-03-04_at_9.50.01_AM.png|thumb|300px]] [[File:Screen Shot 2014-04-07 at 2.27.54 PM.png|thumb|300px]]
-===Video Tutorial===
+==Video Tutorial==
 {{#ev:youtube|3fNyHGB02dM}}
-*Demo fastq data for Arabidopsis Col-0:
+==Demo fastq data for Arabidopsis Col-0:==
 ** 0.17M reads: http://de.iplantcollaborative.org/dl/d/2F807292-34CC-4C8E-96E3-3E668A304D23/test_rna_seq_data_0.17M_reads.fastq
 ** 1M reads: http://de.iplantcollaborative.org/dl/d/EFD4F983-80B1-4388-94C4-AD78E73D2795/test_rna_seq_data_1M_reads.fastq
 ** 7.6M reads: http://de.iplantcollaborative.org/dl/d/9F6602D6-C66B-4C97-A72A-180AAE55AF95/test_rna_seq_data_7.6M_reads.fastq

Difference between revisions of "Expression Analysis Pipeline"

Latest revision as of 15:40, 5 January 2016

Workflow

Video Tutorial

Demo fastq data for Arabidopsis Col-0:

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

CoGe links

Sites Linked to CoGe

Sites Linked from CoGe

Toolbox