Ancestral Reconstruction Pipeline: Difference between revisions

From CoGepedia
Jump to navigation Jump to search
No edit summary
 
(44 intermediate revisions by the same user not shown)
Line 1: Line 1:
This page is to document the Ancestral Reconstruction Pipeline by Chunfang Zheng
[[Image:20140424 165903.jpg|thumb|right|200px|Pipeline]] [[Image:Photo.JPG|thumb|right|200px|Refactored pipeline]]


Master control is from her batch script: batchFile.txt
*[[Images from hackathon]]
*[[Notes on pipeline details]]


<br>
=== Plan for refactoring ===
<pre>#compile
#gets gene pairs from SynMap output
javac TestGetGenomes.java


#run with config file
==== GetGenomes: remove config file, add option to specify output dir for output files ====
#config file: number of genomes and syntenic depth relationships
This program gets data ready for downstream processing
java TestGetGenomes data/inputInfoCoGe.txt
* -d <directory of input synmap files>
* -g gid1,gid2,gid3,gid4... &lt;list of common separated coge genome ids&gt;
* -p p1,p2,p3,p4 &lt;list of comma separated ploidy levels for genomes -- note these are paired ordered data with the -g option?&gt;
* -s < subgenome_file>
* -o &lt;output directory for SubGenomeInGeneOrder OrthologSets GenomeInString files&gt;
====GetContigInput====
This program gets data ready for ancestral ordering of genes by MWM
*Remove config file dependency
* -g gid1,gid2,gid3,gid4... &lt;list of common separated coge genome ids&gt;
* -w w1,w2,w3,w4 &lt;list of comma separated weights for genomes -- note these are paired ordered data with the -g option?&gt;
* -wa <threshold minimum adjacency score for keeping a contig.  Called 'weightOfAdjacent' in original config file>
* -i <input file: GenomesInString from program GetGenomes>
* -o <output file name>
Note:  Output is now a tab delimited file with each line containing:  vertex vertex weight .  These data will be used by the MEMPython data below


#outputs from above
====MWMPython: http://jorisvr.nl/maximummatching.html needs command line options for ====
This program is a general tool for Maximum Weight Matching.  First run is for ancestral gene ordering.  Second run is for ancestral contig ordering
* -i &lt;input file or directory&gt;
** File type is a set of vertex vertex weight
** note: if directory, will batch process all files
* -o &lt;outfile or directory&gt;  If no option is specified, the results go to STDOUT


javac TestGetContigInput.java
====GetContigOutputAndScaffoldInput====
java TestGetContigInput data/inputInfoAGRP.txt
This program maps ancestral contigs back to various genomes, gets their positions, and gets data formatted for a second MWM to generate ancestral ordered contigs
cd outputFiles
Note:  goal is to get everything onto the command line. Currently, several of the files are hardcoded
python contigInput_8400_9050_10997_19515.py&gt; contigOutput.txt
*-mml <threshold minimum mapping length. The minimum number of "genes" mapping to a subgenome. Called 'minimumGeneGroupLength' in original config file>
cd ..
'''Note''':  genomeInContigIndex is specified in the config file. This is the weighting for each subgenome. These values will be removed and the weights for genomes used instead:
javac TestGetContigOutputAndScaffoldInput.java
* -g gid1,gid2,gid3,gid4... &lt;list of common separated coge genome ids&gt;  
java TestGetContigOutputAndScaffoldInput data/inputInfoAGRP.txt
* -w w1,w2,w3,w4 &lt;list of comma separated weights for genomes -- note these are paired ordered data with the -g option>
cd outputFiles
*-co <configoutput file generated by the MWMPython program>
python scaffoldInput1.py &gt; scaffoldOutput1.txt
*-s <subGenomesInGeneOrder file>
python scaffoldInput2.py &gt; scaffoldOutput2.txt
*-gf <genomeInString file>
python scaffoldInput3.py &gt; scaffoldOutput3.txt
* -o <output directory name>
python scaffoldInput4.py &gt; scaffoldOutput4.txt
'''Note:''' Multiple output files, one for each ancestral chromosome bin
python scaffoldInput5.py &gt; scaffoldOutput5.txt
python scaffoldInput6.py &gt; scaffoldOutput6.txt
python scaffoldInput7.py &gt; scaffoldOutput7.txt
cd ..
javac TestScaffoldOutput.java
java TestScaffoldOutput


'''Note:'''  There are three sets of data that are needed for the subsequent steps:
* Directory named "scaffolds".  These tab delimited files with each line containing:  vertex vertex weight.  These data will be used by the MEMPython program
* Directory named "binfiles".  These files are used in the final program ScaffoldOutput to assign the correct contigs to the bins
* File named "contig2genes.txt"  This file is used to convert the contigs back to genes.  These are 'syntenic gene sets'.


</pre>
====MWMPython====
Reuse of the same program listed above for ancestral ordering of contigs


inputInfo example file (describes input from CoGe)
====ScaffoldOutput====
 
Putting all the output data back together for the final ancestral genome
<pre>
*-im <directory of MWM files>
#obvious
*-ib <directory composition of bin files>
numberOfGenomes 4
** This directory is called 'binfiles' when generated by GetContigOutputAndScaffoldInput
numberOfGenomePairs 9
*-cg <file containing the looking up of gene families comprising contigs
 
**This file is named 'contig2genes.txt' when generated by GetContigOutputAndScaffoldInput
#synmap output with correct syntenic depth
*-o <output file of reconstructed genome>
8400 9050 data/8400_9050.CDS-CDS.last.tdd10.cs0.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac3.3.40.gcoords
10997 8400 data/10997_8400.CDS-CDS.last.tdd10.cs0.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac3.3.40.gcoords
10997 9050 data/10997_9050.CDS-CDS.last.tdd10.cs0.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac3.3.40.gcoords
10997 19515 data/10997_19515.CDS-CDS.last.tdd10.cs0.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac3.1.40.gcoords
19515 8400 data/19515_8400.CDS-CDS.last.tdd10.cs0.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac1.3.40.gcoords
19515 9050 data/19515_9050.CDS-CDS.last.tdd10.cs0.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac1.3.40.gcoords
8400 8400 data/8400_8400.CDS-CDS.last.tdd10.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac3.3.40.gcoords
9050 9050 data/9050_9050.CDS-CDS.last.tdd10.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac3.3.40.gcoords
10997 10997 data/10997_10997.CDS-CDS.last.tdd10.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac3.3.40.gcoords
 
#syntenic depth among the genomes
#peach
8400 3
#grape
9050 3
Cacao
10997 3
#amborella
19515 1
 
#subgenome information
data/subGenomeRegions.txt
</pre>
 
===SubGenomeRegions.txt===
This file contains infromation about subgenomes (parental genomes) making up an extant genome. Chunfang often creates these by hand, but does have a program to generate this.  Practical_Aliquoting
<pre>
#genome_ID    number_of_synteny_blocks  paleopolyploid_depth title_for_set
10997 21 3 cacao
#colorCode: means ancestral chromosome assignment -- better term is bin.  For eudicots, this is thought to be 7 (could be other numbers for other reconstructions)
#subgenome: which subgenome to which a block belongs
#chr start end:  position of block in extant genome
colorCode subgenome chr start end
1 1 2 12716774 27462648
1 2 4 349021 14314443
1 3 3 208385 16091087
2 1 3 19982135 24212437
2 2 1 27207631 30674661
2 3 3 16741484 19970692
3 1 2 1350572 7237080
3 2 1 315357 7988483
3 3 8 43353 6712481
4 1 9 739576 3437504
4 2 6 1071819 9467758
4 3 9 3437504 9589803
5 1 4 18966341 23343107
5 2 1 21083375 26683534
5 3 5 23329957 25395907
6 1 9 23851693 28019603
6 2 5 541440 5362779
6 3 10 333882 12953021
7 1 6 10864133 14795052
7 2 1 8722499 15224371
7 3 7 511932 6542889
19515 0 1 amborella
colorCode subgenome chr start end
</pre>

Latest revision as of 01:46, 26 April 2014

Pipeline
Refactored pipeline

Plan for refactoring

GetGenomes: remove config file, add option to specify output dir for output files

This program gets data ready for downstream processing

  • -d <directory of input synmap files>
  • -g gid1,gid2,gid3,gid4... <list of common separated coge genome ids>
  • -p p1,p2,p3,p4 <list of comma separated ploidy levels for genomes -- note these are paired ordered data with the -g option?>
  • -s < subgenome_file>
  • -o <output directory for SubGenomeInGeneOrder OrthologSets GenomeInString files>

GetContigInput

This program gets data ready for ancestral ordering of genes by MWM

  • Remove config file dependency
  • -g gid1,gid2,gid3,gid4... <list of common separated coge genome ids>
  • -w w1,w2,w3,w4 <list of comma separated weights for genomes -- note these are paired ordered data with the -g option?>
  • -wa <threshold minimum adjacency score for keeping a contig. Called 'weightOfAdjacent' in original config file>
  • -i <input file: GenomesInString from program GetGenomes>
  • -o <output file name>

Note: Output is now a tab delimited file with each line containing: vertex vertex weight . These data will be used by the MEMPython data below

MWMPython: http://jorisvr.nl/maximummatching.html needs command line options for

This program is a general tool for Maximum Weight Matching. First run is for ancestral gene ordering. Second run is for ancestral contig ordering

  • -i <input file or directory>
    • File type is a set of vertex vertex weight
    • note: if directory, will batch process all files
  • -o <outfile or directory> If no option is specified, the results go to STDOUT

GetContigOutputAndScaffoldInput

This program maps ancestral contigs back to various genomes, gets their positions, and gets data formatted for a second MWM to generate ancestral ordered contigs Note: goal is to get everything onto the command line. Currently, several of the files are hardcoded

  • -mml <threshold minimum mapping length. The minimum number of "genes" mapping to a subgenome. Called 'minimumGeneGroupLength' in original config file>

Note: genomeInContigIndex is specified in the config file. This is the weighting for each subgenome. These values will be removed and the weights for genomes used instead:

  • -g gid1,gid2,gid3,gid4... <list of common separated coge genome ids>
  • -w w1,w2,w3,w4 <list of comma separated weights for genomes -- note these are paired ordered data with the -g option>
  • -co <configoutput file generated by the MWMPython program>
  • -s <subGenomesInGeneOrder file>
  • -gf <genomeInString file>
  • -o <output directory name>

Note: Multiple output files, one for each ancestral chromosome bin

Note: There are three sets of data that are needed for the subsequent steps:

  • Directory named "scaffolds". These tab delimited files with each line containing: vertex vertex weight. These data will be used by the MEMPython program
  • Directory named "binfiles". These files are used in the final program ScaffoldOutput to assign the correct contigs to the bins
  • File named "contig2genes.txt" This file is used to convert the contigs back to genes. These are 'syntenic gene sets'.

MWMPython

Reuse of the same program listed above for ancestral ordering of contigs

ScaffoldOutput

Putting all the output data back together for the final ancestral genome

  • -im <directory of MWM files>
  • -ib <directory composition of bin files>
    • This directory is called 'binfiles' when generated by GetContigOutputAndScaffoldInput
  • -cg <file containing the looking up of gene families comprising contigs
    • This file is named 'contig2genes.txt' when generated by GetContigOutputAndScaffoldInput
  • -o <output file of reconstructed genome>