Difference between revisions of "Processing RNA seq data"

From CoGepedia
Jump to: navigation, search
(Alignment)
(Format Conversion and Sorting)
Line 13: Line 13:
  
 
==Format Conversion and Sorting==
 
==Format Conversion and Sorting==
 +
 +
GSNAP outputs alignments in SAM (Sequence Alignment/Mapping) format which is (to the trained eye) human readable, but not very efficient in terms of either storage space or random access. The pipeline converts the initial SAM output into a sorted BAM file using [http://samtools.sourceforge.net/ SAMtools].
  
 
==Expression Quantification==
 
==Expression Quantification==

Revision as of 12:30, 12 February 2014

CoGe's RNA-seq processing pipeline is highly automated version of the qTeller RNA-seq processing pipeline.

For detailed instructions on how to use these same tools manually see this PDF. Automated python scripts for batch processing large numbers of datasets are available here.

Quality Trimming

When using RNA-seq data to quantify expression (rather than calling SNPs or transcriptome assembly), quality trimming should be regarded as an optional step. However, it can substantially reduce total run-time by avoiding requiring aligners to spend significant amounts of processor time searching for valid alignment sites for low quality reads.

The version of the qTeller pipeline employed by CoGe uses "cutadapt" for quality trimming. The default settings for running cutadapt within the pipeline treats as low quality positions in a read with quality scores less than 25 (those with an approximately 1 in 300 chance of being a mis-call). Reads which are less than 17 base pairs long after quality trimming are discarded as that is the minimum length required by the software used to perform alignments.

Alignment

The processed reads generated by quality trimming are then aligned against a reference genome selected from those loaded into CoGe. Alignments are performed using GSNAP, a program which enables the efficient and accurate spliced alignment of reads (necessary when aligning reads originating from mRNA against a genomic reference sequence). While several other tools are available which satisfied both criteria (notably RUM, and tophat), we have found that GSNAP provides the best mix of extremely high accuracy with practical run times. The default settings used for loading data in CoGe discard reads with >5 equally good "best" map locations as well as reads where the best alignment contains five or more mismatches.

Format Conversion and Sorting

GSNAP outputs alignments in SAM (Sequence Alignment/Mapping) format which is (to the trained eye) human readable, but not very efficient in terms of either storage space or random access. The pipeline converts the initial SAM output into a sorted BAM file using SAMtools.

Expression Quantification