Syntenic path assembly

From CoGepedia
(Redirected from Syntenic Path Assembly)
Jump to: navigation, search

Overview

The Syntenic Path Assembly option can be turned on by selecting the "Display Options" tab and checking the box next to "Syntenic Path Assembly (SPA)"

The Syntenic Path Assembly is an option in SynMap to do a reference genome assembly of contigs using synteny to determine the order and orientation of the contigs. To use this option, select "Order contigs by best syntenic path" under the Display Options tab. When an assembly is generated, you may download the Pseudoassembled sequence (contigs are joined using 100 "N"s. "N" is the Ambiguous nucleotide and while it may represent any nucleotide (A, T C, G), this permits the identification of where contigs were "glued" together by this algorithm.).

This algorithm also works quite well in aligning a WGS assembly between distally related organisms. (See below for examples.) Note: Caution needs to be taken about using and trusting a syntenic path assembly. Breakpoints in genome assembly are often due to stretches of repetitive sequence, which can also serve as the sites for genomic rearrangements such as inversions, duplications, and chromosome fusions and fissions. Please be aware that If a pseudo-assembly genome is used in a syntenic path assembly of another genome, assembly errors will be propagated. If a pseudo-assembled genome is added back to CoGe, it is highly recommended that it is annotated as such.

Algorithm

As implemented in SynMap:


  1. Parse syntenic blocks from SynMap output file:
    1. Score (if not present, equal to the number of gene pairs in block)
    2. Orientation (forward or reverse)
    3. Names of chromosomes/contigs involved in block
    4. Start position:
      1. Remove the first and last gene-pair (often are noisy), if there are more 3 or more gene pairs.
      2. Calculate the mean start value of all the remaining gene-pairs in a syntenic block
        1. E.g.: 5 gene pairs with start values of 10, 1000, 1200, 1400, 5000. Start value = (1000+1200+1400)/3 = 1200
  2. Sum all synteny scores for each pair of chromosomes/contigs between the two genomes
    1. E.g. if Contig1 and Chr1 have two syntenic blocks, scores 5 and 6, their combined synteny score is 11. (New: May 2012)
  3. Determine which genome is the reference genome due to having fewer chromosomes/contigs
    1. For the examples listed here, the reference genome is assumed to have chromosomes, and the genome to be assembled has contigs
  4. Determine assignment of a contig to a chromosome based on having the highest combined synteny score
    1. E.g. if Contig1 and Chr1 have a combined synteny score of 11, and Contig1 and Chr2 have a combined synteny store of 8, Contig1 will be assigned to be mapped to Chr1
  5. Order contigs to a chromsome based on start position (calculation described above)
  6. Orient contig based on the majority rule of the orientation of the syntenic blocks.
    1. Specificly, if half or more of the syntenic blocks were in reverse orientation, the contig is reverse complemented.


Thus, the tiling looks like:

|-----A-----|   |-----B-----|   |-----C-----| 
|-1-||-2-||-3-| |-4-||-5-||-6-| |-7-||-8-||-9-| etc.,
  • Improvements (May 2012)
    • Selection of which contig maps to which chromosome has been update to be
      • The maximum combined score of all syntenic blocks
      • For example. Contig1 has two syntenic blocks with Chr1 (scores 5 and 5); one syntenic block with Chr2 (score 8). Contig1 is placed with Chr1.


Old Version

  1. Identify and score each syntenic region
  2. Assign genome with fewer chromosomes/contigs to be reference
  3. For each chromosome in reference genome:
    1. sort all syntenic blocks that match reference genome according to position in reference genome
    2. for ties in position, pick the syntenic block with the greater synteny score
    3. flip contig if syntenic block is reversed

Thus, the tiling looks like:

|-----A-----|   |-----B-----|   |-----C-----| 
|-1-||-2-||-3-| |-4-||-5-||-6-| |-7-||-8-||-9-| etc.,

Definitions


Screen shot 2011-02-18 at 1.58.23 PM.png

E. coli


Tetraodon (puffer fish): Takifugu rubripes and Tetraodon nigroviridis

Carnivora: Giant Panda (WGS Assembly) to Dog (reference genome)

Arabidopsis ecotypes: Columbia versus Landsberg erecta


Phoenix dactylifera L. (date palm) v. Oryza sativa japonica (Rice)

Cannabis sativa (marijuana) v. Prunus persica (peach)