Ancestral Reconstruction Pipeline: Difference between revisions

← Older edit

Latest revision as of 01:46, 26 April 2014

Plan for refactoring

GetGenomes: remove config file, add option to specify output dir for output files

This program gets data ready for downstream processing

-d <directory of input synmap files>
-g gid1,gid2,gid3,gid4... <list of common separated coge genome ids>
-p p1,p2,p3,p4 <list of comma separated ploidy levels for genomes -- note these are paired ordered data with the -g option?>
-s < subgenome_file>
-o <output directory for SubGenomeInGeneOrder OrthologSets GenomeInString files>

GetContigInput

This program gets data ready for ancestral ordering of genes by MWM

Remove config file dependency
-g gid1,gid2,gid3,gid4... <list of common separated coge genome ids>
-w w1,w2,w3,w4 <list of comma separated weights for genomes -- note these are paired ordered data with the -g option?>
-wa <threshold minimum adjacency score for keeping a contig. Called 'weightOfAdjacent' in original config file>
-i <input file: GenomesInString from program GetGenomes>
-o <output file name>

Note: Output is now a tab delimited file with each line containing: vertex vertex weight . These data will be used by the MEMPython data below

MWMPython: http://jorisvr.nl/maximummatching.html needs command line options for

This program is a general tool for Maximum Weight Matching. First run is for ancestral gene ordering. Second run is for ancestral contig ordering

-i <input file or directory>
- File type is a set of vertex vertex weight
- note: if directory, will batch process all files
-o <outfile or directory> If no option is specified, the results go to STDOUT

GetContigOutputAndScaffoldInput

This program maps ancestral contigs back to various genomes, gets their positions, and gets data formatted for a second MWM to generate ancestral ordered contigs Note: goal is to get everything onto the command line. Currently, several of the files are hardcoded

-mml <threshold minimum mapping length. The minimum number of "genes" mapping to a subgenome. Called 'minimumGeneGroupLength' in original config file>

Note: genomeInContigIndex is specified in the config file. This is the weighting for each subgenome. These values will be removed and the weights for genomes used instead:

-g gid1,gid2,gid3,gid4... <list of common separated coge genome ids>
-w w1,w2,w3,w4 <list of comma separated weights for genomes -- note these are paired ordered data with the -g option>
-co <configoutput file generated by the MWMPython program>
-s <subGenomesInGeneOrder file>
-gf <genomeInString file>
-o <output directory name>

Note: Multiple output files, one for each ancestral chromosome bin

Note: There are three sets of data that are needed for the subsequent steps:

Directory named "scaffolds". These tab delimited files with each line containing: vertex vertex weight. These data will be used by the MEMPython program
Directory named "binfiles". These files are used in the final program ScaffoldOutput to assign the correct contigs to the bins
File named "contig2genes.txt" This file is used to convert the contigs back to genes. These are 'syntenic gene sets'.

MWMPython

Reuse of the same program listed above for ancestral ordering of contigs

ScaffoldOutput

Putting all the output data back together for the final ancestral genome

-im <directory of MWM files>
-ib <directory composition of bin files>
- This directory is called 'binfiles' when generated by GetContigOutputAndScaffoldInput
-cg <file containing the looking up of gene families comprising contigs
- This file is named 'contig2genes.txt' when generated by GetContigOutputAndScaffoldInput
-o <output file of reconstructed genome>

@@ Line 1: / Line 1: @@
-This page is to document the Ancestral Reconstruction Pipeline by Chunfang Zheng
+[[Image:20140424 165903.jpg|thumb|right|200px|Pipeline]] [[Image:Photo.JPG|thumb|right|200px|Refactored pipeline]]
-Master control is from her batch script: batchFile.txt
+*[[Images from hackathon]]
+*[[Notes on pipeline details]]
-<br>
+=== Plan for refactoring ===
-<pre>#compile
+==== GetGenomes: remove config file, add option to specify output dir for output files ====
-#gets gene pairs from SynMap output
+This program gets data ready for downstream processing
-javac TestGetGenomes.java
+* -d <directory of input synmap files>
+* -g gid1,gid2,gid3,gid4... &lt;list of common separated coge genome ids&gt;
+* -p p1,p2,p3,p4 &lt;list of comma separated ploidy levels for genomes -- note these are paired ordered data with the -g option?&gt;
+* -s < subgenome_file>
+* -o &lt;output directory for SubGenomeInGeneOrder OrthologSets GenomeInString files&gt;
+====GetContigInput====
+This program gets data ready for ancestral ordering of genes by MWM
+*Remove config file dependency
+* -g gid1,gid2,gid3,gid4... &lt;list of common separated coge genome ids&gt;
+* -w w1,w2,w3,w4 &lt;list of comma separated weights for genomes -- note these are paired ordered data with the -g option?&gt;
+* -wa <threshold minimum adjacency score for keeping a contig.  Called 'weightOfAdjacent' in original config file>
+* -i <input file: GenomesInString from program GetGenomes>
+* -o <output file name>
+Note:  Output is now a tab delimited file with each line containing:  vertex vertex weight .  These data will be used by the MEMPython data below
-#run with config file
+====MWMPython: http://jorisvr.nl/maximummatching.html needs command line options for ====
-#config file: number of genomes and syntenic depth relationships
+This program is a general tool for Maximum Weight Matching.  First run is for ancestral gene ordering.  Second run is for ancestral contig ordering
+* -i &lt;input file or directory&gt;
+** File type is a set of vertex vertex weight
+** note: if directory, will batch process all files
+* -o &lt;outfile or directory&gt;  If no option is specified, the results go to STDOUT
-java TestGetGenomes data/inputInfoCoGe.txt
+====GetContigOutputAndScaffoldInput====
+This program maps ancestral contigs back to various genomes, gets their positions, and gets data formatted for a second MWM to generate ancestral ordered contigs
+Note:  goal is to get everything onto the command line.  Currently, several of the files are hardcoded
+*-mml <threshold minimum mapping length.  The minimum number of "genes" mapping to a subgenome. Called 'minimumGeneGroupLength' in original config file>
+'''Note''':  genomeInContigIndex is specified in the config file.  This is the weighting for each subgenome.  These values will be removed and the weights for genomes used instead:
+* -g gid1,gid2,gid3,gid4... &lt;list of common separated coge genome ids&gt;
+* -w w1,w2,w3,w4 &lt;list of comma separated weights for genomes -- note these are paired ordered data with the -g option>
+*-co <configoutput file generated by the MWMPython program>
+*-s <subGenomesInGeneOrder file>
+*-gf <genomeInString file>
+* -o <output directory name>
+'''Note:''' Multiple output files, one for each ancestral chromosome bin
-#outputs from above:
+'''Note:'''  There are three sets of data that are needed for the subsequent steps:
-#               orthologSets_8400_9050_10997_19515.txt
+* Directory named "scaffolds".  These tab delimited files with each line containing:  vertex vertex weight.  These data will be used by the MEMPython program
-# This file contains orhtologous sets of genes across genome
+* Directory named "binfiles".  These files are used in the final program ScaffoldOutput to assign the correct contigs to the bins
+* File named "contig2genes.txt"  This file is used to convert the contigs back to genes.  These are 'syntenic gene sets'.
-<pre>
+====MWMPython====
-#ortholog_set org_1_paralog_1|org_1_paralog_2|org_1|paralog_3 org_2_paralog_1|etc. .
+Reuse of the same program listed above for ancestral ordering of contigs
-#max number of paralog genes in an organism is based on the syntenic depth for that organism.  E.g., three for peach
-	PAC:17653852|PAC:17650319|PAC:17656076	GSVIVG01035853001|GSVIVG01017316001|GSVIVG01015543001	Tc06_g001640|Tc09_g002300|Tc09_g011400	evm_27.model.AmTr_v1.0_scaffold00078.25
-	PAC:17659187|PAC:17667521|PAC:17645335	GSVIVG01024935001|GSVIVG01034155001|GSVIVG01016522001	Tc05_g004430|Tc09_g031500|Tc10_g002720	evm_27.model.AmTr_v1.0_scaffold00012.209
-	PAC:17660832|PAC:17641120|PAC:17649248	GSVIVG01035855001|GSVIVG01017319001|GSVIVG01015546001	Tc06_g001660|Tc09_g002290|Tc09_g011430	evm_27.model.AmTr_v1.0_scaffold00078.24
-....
-	PAC:17654291	missing	Tc02_g001110	missing
-</pre>
+====ScaffoldOutput====
-#               genomeInString_8400_9050_10997_19515.txt
+Putting all the output data back together for the final ancestral genome
-#               subgenomeRangesInGeneOrder_8400_9050_10997_19515.txt
+*-im <directory of MWM files>
+*-ib <directory composition of bin files>
-javac TestGetContigInput.java
+** This directory is called 'binfiles' when generated by GetContigOutputAndScaffoldInput
-java TestGetContigInput data/inputInfoAGRP.txt
+*-cg <file containing the looking up of gene families comprising contigs
-cd outputFiles
+**This file is named 'contig2genes.txt' when generated by GetContigOutputAndScaffoldInput
-python contigInput_8400_9050_10997_19515.py&gt; contigOutput.txt
+*-o <output file of reconstructed genome>
-cd ..
-javac TestGetContigOutputAndScaffoldInput.java
-java TestGetContigOutputAndScaffoldInput data/inputInfoAGRP.txt
-cd outputFiles
-python scaffoldInput1.py &gt; scaffoldOutput1.txt
-python scaffoldInput2.py &gt; scaffoldOutput2.txt
-python scaffoldInput3.py &gt; scaffoldOutput3.txt
-python scaffoldInput4.py &gt; scaffoldOutput4.txt
-python scaffoldInput5.py &gt; scaffoldOutput5.txt
-python scaffoldInput6.py &gt; scaffoldOutput6.txt
-python scaffoldInput7.py &gt; scaffoldOutput7.txt
-cd ..
-javac TestScaffoldOutput.java
-java TestScaffoldOutput
-</pre>
-inputInfo example file (describes input from CoGe)
-<pre>
-#obvious
-numberOfGenomes	4
-numberOfGenomePairs	9
-#synmap output with correct syntenic depth
-	9050	data/8400_9050.CDS-CDS.last.tdd10.cs0.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac3.3.40.gcoords
-	8400	data/10997_8400.CDS-CDS.last.tdd10.cs0.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac3.3.40.gcoords
-	9050	data/10997_9050.CDS-CDS.last.tdd10.cs0.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac3.3.40.gcoords
-	19515	data/10997_19515.CDS-CDS.last.tdd10.cs0.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac3.1.40.gcoords
-	8400	data/19515_8400.CDS-CDS.last.tdd10.cs0.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac1.3.40.gcoords
-	9050	data/19515_9050.CDS-CDS.last.tdd10.cs0.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac1.3.40.gcoords
-	8400	data/8400_8400.CDS-CDS.last.tdd10.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac3.3.40.gcoords
-	9050	data/9050_9050.CDS-CDS.last.tdd10.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac3.3.40.gcoords
-	10997	data/10997_10997.CDS-CDS.last.tdd10.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac3.3.40.gcoords
-#syntenic depth among the genomes
-#peach
-	3
-#grape
-	3
-Cacao
-	3
-#amborella
-	1
-#subgenome information
-data/subGenomeRegions.txt
-</pre>
-===SubGenomeRegions.txt===
-This file contains infromation about subgenomes (parental genomes) making up an extant genome.  Chunfang often creates these by hand, but does have a program to generate this.  Practical_Aliquoting
-<pre>
-#genome_ID    number_of_synteny_blocks   paleopolyploid_depth title_for_set
-	21	3	cacao
-#colorCode: means ancestral chromosome assignment -- better term is bin.  For eudicots, this is thought to be 7 (could be other numbers for other reconstructions)
-#subgenome: which subgenome to which a block belongs
-#chr start end:  position of block in extant genome
-colorCode	subgenome	chr	start	end
-	1	2	12716774	27462648
-	2	4	349021	14314443
-	3	3	208385	16091087
-	1	3	19982135	24212437
-	2	1	27207631	30674661
-	3	3	16741484	19970692
-	1	2	1350572	7237080
-	2	1	315357	7988483
-	3	8	43353	6712481
-	1	9	739576	3437504
-	2	6	1071819	9467758
-	3	9	3437504	9589803
-	1	4	18966341	23343107
-	2	1	21083375	26683534
-	3	5	23329957	25395907
-	1	9	23851693	28019603
-	2	5	541440	5362779
-	3	10	333882	12953021
-	1	6	10864133	14795052
-	2	1	8722499	15224371
-	3	7	511932	6542889
-	0	1	amborella
-colorCode	subgenome	chr	start	end
-</pre>

Ancestral Reconstruction Pipeline: Difference between revisions

Latest revision as of 01:46, 26 April 2014

Contents

Plan for refactoring

GetGenomes: remove config file, add option to specify output dir for output files

GetContigInput

MWMPython: http://jorisvr.nl/maximummatching.html needs command line options for

GetContigOutputAndScaffoldInput

MWMPython

ScaffoldOutput

Navigation menu

Ancestral Reconstruction Pipeline: Difference between revisions

Latest revision as of 01:46, 26 April 2014

Plan for refactoring

GetGenomes: remove config file, add option to specify output dir for output files

GetContigInput

MWMPython: http://jorisvr.nl/maximummatching.html needs command line options for

GetContigOutputAndScaffoldInput

MWMPython

ScaffoldOutput

Navigation menu

Search