PolyMFind

From CoGepedia
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Overview

PolyMFind is a program that scans through a multiple genome alignment to identify and classify all polymorphisms. The multiple sequence alignments are generated by Mauve. Since the alignment is generated using sequences stored in CoGe, PolyMFind can query the location of each polymorphism and identify any underlying genomic feature. If the genomic feature is a protein coding gene, the change in coding sequence is assessed to determine if it causes a frameshift, synonymous, or nonsynonymous mutation.

Note: There a several sources of errors for these polymorphisms that from three major factors:

  • Sequencing errors
  • Assembly errors
  • Multiple sequence alignment errors

PolyMFind not only identifies and classifies polymorphisms, but also provides:

  • A false positive score of the polymorphism to help prioritize ones to investigate further
  • Views of data to assess whether a polymorphism is real
  • Tools to compare genomic regions to assess whether a polymorphism is real
  • Information about genes with polymorphisms
  • Links to extract addition information for genes
  • Summary tables of:
    • All polymorphisms detected
    • Polymorphisms detected in each genome
    • Number of polymorphisms that may be the result of a homopolymer sequencing error
    • Counts of all the false positive scores

List of information in polymorphism table by column header

Note: all columns may be sorted by clicking on the header for that column.

  • Position: The position of the polymorphism as referenced by the first genome. This is approximate in the other genomes due to deletions, insertions, and missing sequence.
  • Position Sort: The "position" without commas so the column may be sorted numerically instead of alphanumerically
  • Gene: The name(s) of the gene (if one is present) affected by the polymorphism. This contains link to CoGe's FeatView program for getting more information about the gene.
  • False positive score: The combined false positive score of the polymorphism. This is the sum of two factors: wether the polymorphism is in a homopolymer and how many polymorphisms the gene has. Both of these metrics are shown at the end of the table and describe in more detail below.
  • Type: The type of polymorphism:
    • SNP: Single Nucleotide Polymorphism
    • indel: a insertion or deletion that occurs in more than one strain
    • deletion: a single strain missing some sequence
    • insertion: a single strain containing additional sequence
    • contig join: a stretch of the ambiguous nucleotide "N" that is used by the syntenic path assembly algorithm to stich together neighboring contigs when generating a full length genome.
  • Subtype:
    • Frameshift: an insertion/deletion/indel that in a protein coding sequence that causes a change in the reading frame for translation
    • Large deletion/insertion: an insertion or deletion that is larger than 100 nt
    • Nonsynonymous: a SNP in a protein coding sequence that results in the change of the encoded amino acid
    • Synonymous: a SNP in a protein coding sequence that does not result in a change of the encoded amino acid
  • Size: the size, in nucleotides, of the polymorphism
  • Size Sort: The "size" without commas so the column may be sorted numerically instead of alphanumerically
  • Polymorphism Summary: a single nucleotide view of the polymorphism across all genomes
  • Strain specific polymorphism ( Note: spans several colums): The next set of columns shows the polymorphism for each specific strain at the identified position. Each entry in these columns in a link to CoGe's SeqView program for extracting the sequence around the polymorphism from the genomic sequence.
  • 10 nt align: This column has a link called "show" which will show the alignment of the polymorphism across all strains including 10 nucleotides upstream and downstream from the polymorphism.
Showing the 10 nt alignment around a polymorphism. This is likely a homopolymer sequencing error.

If the polymorphism is for a "contig join", then 100nt are shown upstream and downstream from the polymorphism.

  • Annotation: The annotation for the gene, if present, affected by the polymorphism
  • (Strand) Codon: If the polymorphism is in a protein coding gene, shows the alignment of the codons across the sequences correctly for the strand on which the gene resides
  • GEvo link: Two links to CoGe's GEvo. Each GEvo analysis permits the comparison of the genomic regions around the polymorphism
    • GEvo link: searches 2x the length of the polymorphism or 100 nt upstream and downstream of the polymorphism (whichever value is greater)
    • GEvo link 20K: search 20,000 nucleotides upstream and downstream of the polymorphism
    • CoGe Align Link: Performs a protein/codon alignent of the protein coding genes affected by the polymorphism
    • Homopolymer False Positive Score: Accounts for homopolymer sequencing errors. This is used to add value to the false positive score if the polymorphism is in a homopolymer and the homopolymer is 5 or more nucleotides in length. The score is determined by the length of the homopolymer such that a polymorphism in a homopolymer of length 10 gets a false positive score of 10. If the homopolymer is length 3, no score is added because the homopolymer is less than 5 nt in length. The reason for this cutoff is that from my analysis of sequencing of 8 E. coli.
    • Multi-Hit False Positive Score: Accounts for assembly errors. This takes into account the fact that if there are many hits in the same gene, this is likely due to a mis-assembly either by the de novo assembly or by the syntenic path assembly algorithm. The former is often caused by tandem repeat sequences; the latter by the misplacement or mis-orientation of a contig.

Overview of analysis of a polymorphism region using GEvo

GEvo analysis of a polymorphism identified by PolyMFind using 20,000 nt upstream and downstream.


GEvo analysis of a polymorphism identified by PolyMFind with a high false positive score due to many polymoprhisms occurring in the same gene. This is due to the mis-orientation of contig as well as many polymorphisms near the end of a contig (visualized as white gaps in the pink regions of sequence similarity. The latter is likely due to errors in the de novo assembly.