Methylation Analysis Pipeline

From CoGepedia
Jump to: navigation, search

Summary

Example of percent methylation data tracks displayed against other genomic features

CoGe can analyze bisulfite sequencing data and visualize percent methylation at single-cytosine resolution. Additionally, output filetypes can be easily converted to appropriate formats for automated differential methylation analysis. This analysis pipeline was developed by Jeffrey Grover at the University of Arizona. The metaplot analysis program was developed by Josquin Daron and Keith Slotkin at Ohio State University.

See the LoadExperiment tool to use the new pipeline. Select BAM/FASTQ input files to make the pipeline available.

There are currently two separate analysis pipelines available. One based on Bismark and the other on bwameth.

What Is This For?

The tools in this pipeline are intended for use with whole genome bisulfite sequencing (WGBS). In theory many of them are also appropriate for reduced representation bisulfite sequencing (RRBS), but this is untested. Reads must be in FASTQ format with quality encoding either phred 33 or phred 64 scale, single or paired ended, there must also be an appropriate genome to compare against.

What Is This Not For?

Any other sequencing-based methylation analysis including, but not limited to, MeDIP-seq (methylated DNA immunoprecipitation sequencing), or array-based methylation analyses. FASTQ files with quality encoding not in phred 33/64 (Illumina format), or generally, anything not mentioned in the preceding section. In some cases non Illumina formatted reads can be converted to the appropriate formats.

Note:

It is common for FASTQ files, especially ones downloaded from certain public sources or opened by users to have formatting abnormalities. Please check for common formatting problems, such as extra newlines or special characters, if the pipeline fails. If sequencing reads are contained in multiple files and are intended to be aligned together they should be concatenated before loading (The Unix/Linux command "cat" will work).

Paired-end files must end with the identifier "1" and "2" respectively (e.g., sample1_1.fastq and sample1_2.fastq, or sample1_R1.fastq and sample1_R2.fastq).

Workflow

FASTQ files can either be loaded from a local machine or the CyVerse datastore (recommended).

  1. Reads should be trimmed if this hasn't been done already. Trim Galore! is a wrapper for fastqc and cutadapt available in CoGe that performs adapter trimming, quality-based, and length-based trimming. It will attempt to automatically detect adapter types from a source file but this can be overridden with a nucleotide string if desired.
  2. The reference genome is indexed and in-silico bisulfite converted with either bismark_genome_preparation or bwameth's index functionality.
  3. Depending on the selected pipeline either Bismark (using Bowtie2) or bwameth are used to align bisulfite-converted reads.
  4. Either bismark_deduplicate or picard tools remove PCR duplication artifacts.
  5. Methylation status is extracted with either bismark_methylation_extractor or PileOMeth.
  6. Methylation summaries are then reformatted to .csv to comply with quantitative data loading in CoGe and are filtered by desired read depth (default: 5). This is then loaded into the genome browser as a viewable quantitative track. Top strand methylation is represented as positive numbers (0% to 100% as a decimal) and bottom strand methylation as negative numbers (same scale). When mousing over these quantitative tracks the decimal-formatted methylation % and read depth for that position will be displayed (in that order).

Options

More information is available in the individual tools' documentation. This is a summary of the options visible to users through CoGe.

Trim Galore!

  • -q: trim low quality (phred score) nucleotide calls from end of reads prior to adapter removal (default: 20)
  • --length: remove reads shorter than indicated length after all other trimming is complete (default: 20)
  • -a: override automatic adapter sequence detection with a nucleotide string (default: off)

Bismark

  • -N: number of mismatches allowed in Bowtie2 alignment seed region (default: 0, only 0 or 1 allowed, setting this to 1 will AT LEAST double the run time)
  • -L: seed length for Bowtie2 alignment (default: 20)

bismark_methylation_extractor

  • --ignore: ignore bisulfite mismatches within this many nucleotides of the 5' end of a read (or read 1 of paired end data) (default: 0)
  • --ignore_3prime: ignore bisulfite mismatches within this many nucleotides of the 3' end of a read (or read 1 of paired end data) (default: 0)
  • --ignore_r2: ignore bisulfite mismatches within this many nucleotides of the 5' end of a read of read 2 of paired end data (default: 0)
  • --ignore_3prime_r2: ignore bisulfite mismatches within this many nucleotides of the 3' end of a read of read 2 of paired end data (default: 0)

These options are useful to remove methylation bias at the end of reads if necessary.

PileOMeth

  • --OT: inclusion regions for reads corresponding to the original top strand. Format A,B,C,D means include calls from position A to B on read 1 and C to D on read 2. (default: 0,0,0,0 - whole read)
  • --OB: inclusion regions for reads corresponding to the original bottom strand. Format A,B,C,D means include calls from position A to B on read 1 and C to D on read 2. (default: 0,0,0,0 - whole read)

These options are useful to remove methylation bias at the end of reads if necessary.

Deduplicate: Run bismark_deduplicate (Bismark pipeline) or picard tools (bwameth pipeline) to remove PCR duplicates. (default: off, but should be used in most cases)

Minimum Coverage: Minimum read depth to report methylation percentage and visualize. (default: 5, this should be a sane number for most applications but it's up to the user)

Metaplot: Generates a diagram of average methylation across all features of interest.

  • outer distance: the distance outside the start and end of each feature
  • inner distance: the distance inside the start and end of each feature
  • window size: the size of the sliding window for computing the average

The metaplot is added as metadata to the experiment.

Metaplot.png

Output Files

In addition to direct visualization files are downloadable to enable downstream analyses (differentially methylated region detection, etc.). Individual files in .csv format for each sequence context are downloadable from ExperimentView. These files comply with CoGe's LoadExperiment format for quantitative data and contain the following information:

 #CHR,POSITION,POSITION,STRAND,METHYLATION(0-1),DEPTH

Chromosome IDs will match those of the genome to which reads have been aligned, position is of each cytosine, strand (-1 for bottom, 1 for top), percent methylation is expressed as a decimal between 0 and 1, and read depth at each methylation call will be shown as an integer (filtered for the minimum coverage specified during the analysis pipeline).

The BAM alignment can be downloaded in ExperimentView. Sequence and quantitative information for the visible region can also be downloaded while browsing the genome.

Example Data