Processing RNA seq data

From CoGepedia
Revision as of 12:19, 12 February 2014 by Jschnable (Talk | contribs) (Alignment)

Jump to: navigation, search

CoGe's RNA-seq processing pipeline is highly automated version of the qTeller RNA-seq processing pipeline.

For detailed instructions on how to use these same tools manually see this PDF. Automated python scripts for batch processing large numbers of datasets are available here.

Quality Trimming

When using RNA-seq data to quantify expression (rather than calling SNPs or transcriptome assembly), quality trimming should be regarded as an optional step. However, it can substantially reduce total run-time by avoiding requiring aligners to spend significant amounts of processor time searching for valid alignment sites for low quality reads.

The version of the qTeller pipeline employed by CoGe uses "cutadapt" for quality trimming. The default settings for running cutadapt within the pipeline treats as low quality positions in a read with quality scores less than 25 (those with an approximately 1 in 300 chance of being a mis-call). Reads which are less than 17 base pairs long after quality trimming are discarded as that is the minimum length required by the software used to perform alignments.

Alignment

The processed reads generated by quality trimming are then aligned against a reference genome selected from those loaded into CoGe. Alignments are performed using GSNAP, a program which enables the efficient and accurate spliced alignment of reads (necessary when aligning reads originating from mRNA against a genomic reference sequence). While several other tools are available which satisfied both criteria (notably RUM, and tophat), we have found that GSNAP provides the best mix of extremely high accuracy with practical run times. The default settings used for loading data in CoGe discard reads with >5 equally good "best" map locations as well as reads where the best alignment contains five or more mismatches.

Format Conversion and Sorting

Expression Quantification