Expression Analysis Pipeline

From CoGepedia
Jump to: navigation, search
overview of various types of mapping of data from RNASeq
Comparing the results of data mapped with GSNAP versus Tophat/Bowtie2

CoGe can generate gene/transcript expression measurements given a FASTQ input and an annotated genome. Thanks to James Schnable, creator of qTeller, for help developing this pipeline!


When a FASTQ file of sequence reads is loaded in LoadExperiment and associated with an annotated genome, the following analysis steps are performed:

  1. The FASTQ file is verified for correct format.
  2. CutAdapt is run to trim adapter sequence from the reads (parameters: -q 25 -m 17).
    1. -q: Trim low-quality ends from reads before adapter removal. The algorithm is the same as the one used by BWA (Subtract CUTOFF from all qualities; compute partial sums from all indices to the end of the sequence; cut sequence at the index at which the sum is minimal)
    2. -m: Discard trimmed reads that are shorter than LENGTH. Reads that are too short even before adapter removal are also discarded. In colorspace, an initial primer is not counted (default: 0).
  3. GMAP or Bowtie2 is run to index the reference genome sequence, depending on your choice.
  4. GSNAP or TopHat is run to align the reads to the reference sequence (GSNAP parameters: -N 1 -n 5 --format=sam -Q --gmap-mode=none --nofails, TopHat parameters: -g 1).
  5. SAMtools is run to compute per-position read depth of the resulting alignment (depth -q 20).
  6. Cufflinks is run to compte per-transcript FPKM (parameters: -p 24).
  7. The per-position read depth and per-transcript FPKM values are log transformed and normalized between 0 and 1 for loading.
  8. The three results (raw alignment, per-position read depth, and per-transcript FPKM) are loaded as separate Experiments into a Notebook.

Genomes for which this analysis has been performed can have features imported into qTeller. TBD: how to do this ...

Video Tutorial

Demo fastq data for Arabidopsis Col-0: