Ancestral Reconstruction Pipeline: Difference between revisions

← Older edit Newer edit →

Revision as of 18:48, 25 April 2014

Notes on pipeline details

@@ Line 1: / Line 1: @@
-This page is to document the Ancestral Reconstruction Pipeline by Chunfang Zheng
+[[Notes on pipeline details]]
-Master control is from her batch script: batchFile.txt
-<br>
-<pre>#compile
-#gets gene pairs from SynMap output
-javac TestGetGenomes.java
-#run with config file
-#config file: number of genomes and syntenic depth relationships
-java TestGetGenomes data/inputInfoCoGe.txt
-#outputs from above:
-#               orthologSets_8400_9050_10997_19515.txt
-# This file contains orhtologous sets of genes across genome
-#               genomeInString_8400_9050_10997_19515.txt
-#               subgenomeRangesInGeneOrder_8400_9050_10997_19515.txt
-#This program coordinates getting dat ready for MWM by a python program
-javac TestGetContigInput.java
-java TestGetContigInput data/inputInfoAGRP.txt
-#Generates the first MWM iteration
-cd outputFiles
-python contigInput_8400_9050_10997_19515.py&gt; contigOutput.txt
-# Input is the weighted vertices
-#output is ancestral contigs by otho_gene_family_ids.  Based on these contigs, the second MWM is run
-cd ..
-javac TestGetContigOutputAndScaffoldInput.java
-java TestGetContigOutputAndScaffoldInput data/inputInfoAGRP.txt
-#generates a bunch of python scripts (identical to the above python script) for running MWM.  This time on ancestral contigs mapped to extant genomes.  This provides information on how to link adjacent contigs into ancestral scaffolds.
-cd outputFiles
-python scaffoldInput1.py &gt; scaffoldOutput1.txt
-python scaffoldInput2.py &gt; scaffoldOutput2.txt
-python scaffoldInput3.py &gt; scaffoldOutput3.txt
-python scaffoldInput4.py &gt; scaffoldOutput4.txt
-python scaffoldInput5.py &gt; scaffoldOutput5.txt
-python scaffoldInput6.py &gt; scaffoldOutput6.txt
-python scaffoldInput7.py &gt; scaffoldOutput7.txt
-#joins outputs of above MWM into final ancestral chromosome
-cd ..
-javac TestScaffoldOutput.java
-java TestScaffoldOutput
-</pre>
-inputInfo example file (describes input from CoGe)
-<pre>
-#obvious
-numberOfGenomes	4
-numberOfGenomePairs	9
-#synmap output with correct syntenic depth
-	9050	data/8400_9050.CDS-CDS.last.tdd10.cs0.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac3.3.40.gcoords
-	8400	data/10997_8400.CDS-CDS.last.tdd10.cs0.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac3.3.40.gcoords
-	9050	data/10997_9050.CDS-CDS.last.tdd10.cs0.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac3.3.40.gcoords
-	19515	data/10997_19515.CDS-CDS.last.tdd10.cs0.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac3.1.40.gcoords
-	8400	data/19515_8400.CDS-CDS.last.tdd10.cs0.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac1.3.40.gcoords
-	9050	data/19515_9050.CDS-CDS.last.tdd10.cs0.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac1.3.40.gcoords
-	8400	data/8400_8400.CDS-CDS.last.tdd10.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac3.3.40.gcoords
-	9050	data/9050_9050.CDS-CDS.last.tdd10.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac3.3.40.gcoords
-	10997	data/10997_10997.CDS-CDS.last.tdd10.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac3.3.40.gcoords
-#syntenic depth among the genomes
-#peach
-	3
-#grape
-	3
-Cacao
-	3
-#amborella
-	1
-#subgenome information
-data/subGenomeRegions.txt
-</pre>
-==orthologousSets.txt==
-<pre>
-#ortholog_set org_1_paralog_1|org_1_paralog_2|org_1|paralog_3 org_2_paralog_1|etc. .
-#max number of paralog genes in an organism is based on the syntenic depth for that organism.  E.g., three for peach
-	PAC:17653852|PAC:17650319|PAC:17656076	GSVIVG01035853001|GSVIVG01017316001|GSVIVG01015543001	Tc06_g001640|Tc09_g002300|Tc09_g011400	evm_27.model.AmTr_v1.0_scaffold00078.25
-	PAC:17659187|PAC:17667521|PAC:17645335	GSVIVG01024935001|GSVIVG01034155001|GSVIVG01016522001	Tc05_g004430|Tc09_g031500|Tc10_g002720	evm_27.model.AmTr_v1.0_scaffold00012.209
-	PAC:17660832|PAC:17641120|PAC:17649248	GSVIVG01035855001|GSVIVG01017319001|GSVIVG01015546001	Tc06_g001660|Tc09_g002290|Tc09_g011430	evm_27.model.AmTr_v1.0_scaffold00078.24
-....
-	PAC:17654291	missing	Tc02_g001110	missing
-</pre>
-===SubGenomeRegions.txt===
-This file contains infromation about subgenomes (parental genomes) making up an extant genome.  Chunfang often creates these by hand, but does have a program to generate this.  Practical_Aliquoting
-<pre>
-#genome_ID    number_of_synteny_blocks   paleopolyploid_depth title_for_set
-	21	3	cacao
-#colorCode: means ancestral chromosome assignment -- better term is bin.  For eudicots, this is thought to be 7 (could be other numbers for other reconstructions)
-#subgenome: which subgenome to which a block belongs
-#chr start end:  position of block in extant genome
-colorCode	subgenome	chr	start	end
-	1	2	12716774	27462648
-	2	4	349021	14314443
-	3	3	208385	16091087
-	1	3	19982135	24212437
-	2	1	27207631	30674661
-	3	3	16741484	19970692
-	1	2	1350572	7237080
-	2	1	315357	7988483
-	3	8	43353	6712481
-	1	9	739576	3437504
-	2	6	1071819	9467758
-	3	9	3437504	9589803
-	1	4	18966341	23343107
-	2	1	21083375	26683534
-	3	5	23329957	25395907
-	1	9	23851693	28019603
-	2	5	541440	5362779
-	3	10	333882	12953021
-	1	6	10864133	14795052
-	2	1	8722499	15224371
-	3	7	511932	6542889
-	0	1	amborella
-colorCode	subgenome	chr	start	end
-</pre>
-===inputInfoAGRP.txt===
-<pre>
-####
-#MWM first pass
-####
-numberOfGenomes	4
-#weight for adjacencies based
-genomeIndex	ploidyNumber	weights
-	3	0	2	4	6
-	3	0	2	4	6
-	3	0	2	4	6
-	1	0	3
-#minimum threshold -- each gene family needs at least two adjacencies in this case
-weightOfAdjacent>=	4
-#output of the above generates the .py file: This file take this python program (http://jorisvr.nl/maximummatching.html) and then adds in data generated from the input + above parameters.  Program can be refactored so that these data can be refactored to be read from an external file.
-# Next, python script is run (e.g., python contigInput_8400_9050_10997_19515.py> contigOutput.txt)
-#output looks like contigOutput.txt
-#Three column output:  vertice - vertice - weight
-# There are two lines at the end that has the total weight and time (seconds) to run.  These benchmarking should be labeled and commented out.
-###
-#MWM iteration 2
-###
-#For the first reconstruction, a minimum number of orth_gene_families that are on the same chromosome and in same ancestral chromosome (bin)
-minimumGeneGroupLength>=	3
-AncChrNumber	7
-genomeInContigIndex	weight
-	2
-	2
-	2
-	2
-	2
-	2
-	2
-	2
-	2
-	3
-</pre>

Ancestral Reconstruction Pipeline: Difference between revisions

Revision as of 18:48, 25 April 2014

Navigation menu

Search