Processing RNA seq data

From CoGepedia
Jump to: navigation, search

CoGe's RNA-seq processing pipeline is highly automated version of the qTeller RNA-seq processing pipeline.

This page shows the command line tools and parameters used by CoGe: Expression Analysis Pipeline

For detailed instructions on how to use these same tools manually see this PDF. Automated python scripts for batch processing large numbers of datasets are available here.

Quality Trimming

When using RNA-seq data to quantify expression (rather than calling SNPs or transcriptome assembly), quality trimming should be regarded as an optional step. However, it can substantially reduce total run-time by avoiding requiring aligners to spend significant amounts of processor time searching for valid alignment sites for low quality reads.

The version of the qTeller pipeline employed by CoGe uses "cutadapt" for quality trimming. The default settings for running cutadapt within the pipeline treats as low quality positions in a read with quality scores less than 25 (those with an approximately 1 in 300 chance of being a mis-call). Reads which are less than 17 base pairs long after quality trimming are discarded as that is the minimum length required by the software used to perform alignments.


The processed reads generated by quality trimming are then aligned against a reference genome selected from those loaded into CoGe. Alignments are performed using GSNAP, a program which enables the efficient and accurate spliced alignment of reads (necessary when aligning reads originating from mRNA against a genomic reference sequence). While several other tools are available which satisfied both criteria (notably RUM, and tophat), we have found that GSNAP provides the best mix of extremely high accuracy with practical run times. The default settings used for loading data in CoGe discard reads with >5 equally good "best" map locations as well as reads where the best alignment contains five or more mismatches.

Format Conversion and Sorting

GSNAP outputs alignments in SAM (Sequence Alignment/Mapping) format which is (to the trained eye) human readable, but not very efficient in terms of either storage space or random access. The pipeline converts the initial SAM output into a sorted BAM file using SAMtools.

Expression Quantification

Right now FPKM (frequency per kilobase of exon per million reads) values are calculated using [cufflinks] and GFF files produced by CoGe from a set of gene model annotations selected by the user. This choice was made since cufflinks is widely used and the FPKM values calculated using this software will tend to make the most intuitive sense to people used to working with RNA-seq datasets, even though various versions of cufflinks have produced different weird bugs. If you have a suggestion for an alternative software package, particularly with a case for why it is likely to be widely used and familiar to the community in the future, contact James Schnable.