Methylation Analysis Pipeline

From CoGepedia
Revision as of 16:17, 25 February 2016 by Mbomhoff (Talk | contribs)

Jump to: navigation, search
Example of percent methylation data tracks displayed against other genomic features

CoGe can analyze bisulfite sequencing data and visualize percent methylation at single-cytosine resolution. Additionally, output filetypes can be easily converted to appropriate formats for automated differential methylation analysis. This analysis pipeline is a work in progress being developed by Jeffrey Grover and Matt Bomhoff at the University of Arizona.

See the LoadExperiment tool to use the new pipeline. Select BAM/FASTQ input files to make the pipeline available.

There are currently two separate analysis pipelines available. One based on Bismark and the other on bwameth.

What Is This For?

The tools in this pipeline are intended for use with whole genome bisulfite sequencing (WGBS). In theory many of them are also appropriate for reduced representation bisulfite sequencing (RRBS), but this is untested. Reads must be in fastq format with quality encoding either phred 33 or phred 64 scale, single or paired ended, there must also be an appropriate genome to compare against.

What Is This Not For?

Any other sequencing-based methylation analysis including, but not limited to, MeDIP-seq (methylated DNA immunoprecipitation sequencing), or array-based methylation analyses. Fastq files with quality encoding not in phred 33/64 (Illumina format), or generally, anything not mentioned in the preceding section. In some cases non Illumina formatted reads can be converted to the appropriate formats.

Note:

It is common for fastq files, especially ones downloaded from certain public sources or opened by users to have formatting abnormalities. Please check for common formatting problems, such as extra newlines or special characters, if the pipeline fails. If sequencing reads are contained in multiple files they should be concatenated before loading. Paired-end files should end with the identifier "R1" and "R2" respectively (e.g., sample1_R1.fastq and sample1_R2.fastq).

Workflow

Fastq files can either be loaded from a local machine or the iPlant datastore (recommended).

  1. Reads should be trimmed if this hasn't been done already. Trim Galore! is a wrapper for fastqc and cutadapt available in CoGe that performs adapter trimming, quality-based, and length-based trimming. It will attempt to automatically detect adapter types from a source file but this can be overridden with a nucleotide string if desired.
  2. The reference genome is indexed and in-silico bisulfite converted with either bismark_genome_preparation or bwameth's index functionality.
  3. Depending on the selected pipeline either Bismark (using Bowtie2) or bwameth are used to align bisulfite-converted reads.
  4. Either bismark_deduplicate or picard tools remove PCR duplication artifacts.
  5. Methylation status is extracted with either bismark_methylation_extractor or PileOMeth.
  6. Methylation summaries are then reformatted to .csv to comply with quantitative data loading in CoGe and are filtered by desired read depth (default: 5). This is then loaded into the genome browser as a viewable quantitative track. Top strand methylation is represented as positive numbers (0% to 100% as a decimal) and bottom strand methylation as negative numbers (same scale). When mousing over these quantitative tracks the decimal-formatted methylation % and read depth for that position will be displayed (in that order).

Options

More information is available in the individual tools' documentation. This is a summary of the options visible to users through CoGe.

Trim Galore!

  • -q: trim low quality (phred score) nucleotide calls from end of reads prior to adapter removal (default: 20)
  • --length: remove reads shorter than indicated length after all other trimming is complete (default: 20)
  • -a: override automatic adapter sequence detection with a nucleotide string (default: off)

Bismark

  • -N: number of mismatches allowed in Bowtie2 alignment seed region (default: 0, only 0 or 1 allowed)
  • -L: seed length for Bowtie2 alignment (default: 20)

bismark_methylation_extractor

  • --ignore: ignore bisulfite mismatches within this many nucleotides of the 5' end of a read (or read 1 of paired end data) (default: 0)
  • --ignore_3prime: ignore bisulfite mismatches within this many nucleotides of the 3' end of a read (or read 1 of paired end data) (default: 0)
  • --ignore_r2: ignore bisulfite mismatches within this many nucleotides of the 5' end of a read of read 2 of paired end data (default: 0)
  • --ignore_3prime_r2: ignore bisulfite mismatches within this many nucleotides of the 3' end of a read of read 2 of paired end data (default: 0)

These options are useful to remove methylation bias at the end of reads if necessary.

PileOMeth

  • --OT: inclusion regions for reads corresponding to the original top strand. Format A,B,C,D means include calls from position A to B on read 1 and C to D on read 2. (default: 0,0,0,0 - whole read)
  • --OB: inclusion regions for reads corresponding to the original bottom strand. Format A,B,C,D means include calls from position A to B on read 1 and C to D on read 2. (default: 0,0,0,0 - whole read)

These options are useful to remove methylation bias at the end of reads if necessary.

Deduplicate: Run bismark_deduplicate (Bismark pipeline) or picard tools (bwameth pipeline) to remove PCR duplicates. (default: off, but should be used in most cases)

Minimum Coverage: Minimum read depth to report methylation percentage and visualize. (default: 5, this should be a sane number for most applications but it's up to the user)

Output Files

In addition to direct visualization files are downloadable to enable downstream analyses (differentially methylated region detection, etc.). Exact format TBD.