PolyMFind

From CoGepedia
Revision as of 23:36, 23 June 2011 by Elyons (talk | contribs) (Overview)
Jump to navigation Jump to search

Overview

PolyMFind is a program that scans through a multiple genome alignment to identify and classify all polymorphisms. The multiple sequence alignments are generated by Mauve. Since the alignment is generated using sequences stored in CoGe, PolyMFind can query the location of each polymorphism and identify any underlying genomic feature. If the genomic feature is a protein coding gene, the change in coding sequence is assessed to determine if it causes a frameshift, synonymous, or nonsynonymous mutation.

Note: There a several sources of errors for these polymorphisms that from three major factors:

  • Sequencing errors
  • Assembly errors
  • Multiple sequence alignment errors

PolyMFind not only identifies and classifies polymorphisms, but also provides:

  • A false positive score of the polymorphism to help prioritize ones to investigate further
  • Views of data to assess whether a polymorphism is real
  • Tools to compare genomic regions to assess whether a polymorphism is real
  • Information about genes with polymorphisms
  • Links to extract addition information for genes
  • Summary tables of:
    • All polymorphisms detected
    • Polymorphisms detected in each genome
    • Number of polymorphisms that may be the result of a homopolymer sequencing error
    • Counts of all the false positive scores

List of information in polymorphism table by column header

Note: all columns may be sorted by clicking on the header for that column.

  • Position: The position of the polymorphism as referenced by the first genome. This is approximate in the other genomes due to deletions, insertions, and missing sequence.
  • Position Sort: The "position" without commas so the column may be sorted numerically instead of alphanumerically
  • Gene: The name(s) of the gene (if one is present) affected by the polymorphism. This contains link to CoGe's FeatView program for getting more information about the gene.
  • False positive score: The combined false positive score of the polymorphism. This is the sum of two factors: wether the polymorphism is in a homopolymer and how many polymorphisms the gene has. Both of these metrics are shown at the end of the table and describe in more detail below.
  • Type: The type of polymorphism:
    • SNP: Single Nucleotide Polymorphism
    • indel: a insertion or deletion that occurs in more than one strain
    • deletion: a single strain missing some sequence
    • insertion: a single strain containing additional sequence
    • contig join: a stretch of the ambiguous nucleotide "N" that is used by the syntenic path assembly algorithm to stich together neighboring contigs when generating a full length genome.
  • Subtype:
    • Frameshift: an insertion/deletion/indel that in a protein coding sequence that causes a change in the reading frame for translation
    • Large deletion/insertion: an insertion or deletion that is larger than 100 nt
    • Nonsynonymous: a SNP in a protein coding sequence that results in the change of the encoded amino acid
    • Synonymous: a SNP in a protein coding sequence that does not result in a change of the encoded amino acid
  • Size: the size, in nucleotides, of the polymorphism
  • Size Sort: The "size" without commas so the column may be sorted numerically instead of alphanumerically
  • Polymorphism Summary: a single nucleotide view of the polymorphism across all genomes
  • Strain specific polymorphism ( Note: spans several colums): The next set of columns shows the polymorphism for each specific strain at the identified position. Each entry in these columns in a link to CoGe's SeqView program for extracting the sequence around the polymorphism from the genomic sequence.
  • 10 nt align: This column has a link called "show" which will show the alignment of the polymorphism across all strains including 10 nucleotides upstream and downstream from the polymorphism.
Showing the 10 nt alignment around a polymorphism. This is likely a homopolymer sequencing error.