Ancestral Reconstruction Pipeline

From CoGepedia
Revision as of 23:10, 24 April 2014 by Elyons (talk | contribs)
Jump to navigation Jump to search

This page is to document the Ancestral Reconstruction Pipeline by Chunfang Zheng

Master control is from her batch script: batchFile.txt


#compile 
#gets gene pairs from SynMap output
javac TestGetGenomes.java

#run with config file
#config file: number of genomes and syntenic depth relationships

java TestGetGenomes data/inputInfoCoGe.txt

#outputs from above:
#               orthologSets_8400_9050_10997_19515.txt
# This file contains orhtologous sets of genes across genome

#               genomeInString_8400_9050_10997_19515.txt
#               subgenomeRangesInGeneOrder_8400_9050_10997_19515.txt

#This program coordinates getting dat ready for MWM by a python program
javac TestGetContigInput.java
java TestGetContigInput data/inputInfoAGRP.txt


#Generates the first MWM iteration
cd outputFiles
python contigInput_8400_9050_10997_19515.py> contigOutput.txt


# Input is the weighted vertices 
#output is ancestral contigs by otho_gene_family_ids.  Based on these contigs, the second MWM is run
cd ..
javac TestGetContigOutputAndScaffoldInput.java
java TestGetContigOutputAndScaffoldInput data/inputInfoAGRP.txt


cd outputFiles
python scaffoldInput1.py > scaffoldOutput1.txt
python scaffoldInput2.py > scaffoldOutput2.txt
python scaffoldInput3.py > scaffoldOutput3.txt
python scaffoldInput4.py > scaffoldOutput4.txt
python scaffoldInput5.py > scaffoldOutput5.txt
python scaffoldInput6.py > scaffoldOutput6.txt
python scaffoldInput7.py > scaffoldOutput7.txt
cd ..
javac TestScaffoldOutput.java
java TestScaffoldOutput


inputInfo example file (describes input from CoGe)

#obvious
numberOfGenomes	4
numberOfGenomePairs	9

#synmap output with correct syntenic depth
8400	9050	data/8400_9050.CDS-CDS.last.tdd10.cs0.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac3.3.40.gcoords
10997	8400	data/10997_8400.CDS-CDS.last.tdd10.cs0.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac3.3.40.gcoords
10997	9050	data/10997_9050.CDS-CDS.last.tdd10.cs0.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac3.3.40.gcoords
10997	19515	data/10997_19515.CDS-CDS.last.tdd10.cs0.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac3.1.40.gcoords
19515	8400	data/19515_8400.CDS-CDS.last.tdd10.cs0.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac1.3.40.gcoords
19515	9050	data/19515_9050.CDS-CDS.last.tdd10.cs0.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac1.3.40.gcoords
8400	8400	data/8400_8400.CDS-CDS.last.tdd10.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac3.3.40.gcoords
9050	9050	data/9050_9050.CDS-CDS.last.tdd10.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac3.3.40.gcoords
10997	10997	data/10997_10997.CDS-CDS.last.tdd10.filtered.dag.all.go_D20_g10_A5.aligncoords.Dm0.ma1.qac3.3.40.gcoords

#syntenic depth among the genomes
#peach
8400	3
#grape
9050	3
Cacao
10997	3
#amborella
19515	1

#subgenome information
data/subGenomeRegions.txt

orthologousSets.txt


#ortholog_set org_1_paralog_1|org_1_paralog_2|org_1|paralog_3 org_2_paralog_1|etc. . 
#max number of paralog genes in an organism is based on the syntenic depth for that organism.  E.g., three for peach
1	PAC:17653852|PAC:17650319|PAC:17656076	GSVIVG01035853001|GSVIVG01017316001|GSVIVG01015543001	Tc06_g001640|Tc09_g002300|Tc09_g011400	evm_27.model.AmTr_v1.0_scaffold00078.25	
2	PAC:17659187|PAC:17667521|PAC:17645335	GSVIVG01024935001|GSVIVG01034155001|GSVIVG01016522001	Tc05_g004430|Tc09_g031500|Tc10_g002720	evm_27.model.AmTr_v1.0_scaffold00012.209	
3	PAC:17660832|PAC:17641120|PAC:17649248	GSVIVG01035855001|GSVIVG01017319001|GSVIVG01015546001	Tc06_g001660|Tc09_g002290|Tc09_g011430	evm_27.model.AmTr_v1.0_scaffold00078.24	
....
10483	PAC:17654291	missing	Tc02_g001110	missing	

SubGenomeRegions.txt

This file contains infromation about subgenomes (parental genomes) making up an extant genome. Chunfang often creates these by hand, but does have a program to generate this. Practical_Aliquoting

#genome_ID    number_of_synteny_blocks   paleopolyploid_depth title_for_set
10997	21	3	cacao		
#colorCode: means ancestral chromosome assignment -- better term is bin.  For eudicots, this is thought to be 7 (could be other numbers for other reconstructions)
#subgenome: which subgenome to which a block belongs
#chr start end:  position of block in extant genome
colorCode	subgenome	chr	start	end
1	1	2	12716774	27462648
1	2	4	349021	14314443
1	3	3	208385	16091087
2	1	3	19982135	24212437
2	2	1	27207631	30674661
2	3	3	16741484	19970692
3	1	2	1350572	7237080
3	2	1	315357	7988483
3	3	8	43353	6712481
4	1	9	739576	3437504
4	2	6	1071819	9467758
4	3	9	3437504	9589803
5	1	4	18966341	23343107
5	2	1	21083375	26683534
5	3	5	23329957	25395907
6	1	9	23851693	28019603
6	2	5	541440	5362779
6	3	10	333882	12953021
7	1	6	10864133	14795052
7	2	1	8722499	15224371
7	3	7	511932	6542889
19515	0	1	amborella
colorCode	subgenome	chr	start	end

inputInfoAGRP.txt


####
#MWM first pass
####

numberOfGenomes	4

#weight for adjacencies based
genomeIndex	ploidyNumber	weights
8400	3	0	2	4	6	
9050	3	0	2	4	6
10997	3	0	2	4	6
19515	1	0	3

#minimum threshold -- each gene family needs at least two adjacencies in this case
weightOfAdjacent>=	4

#output of the above generates the .py file: This file take this python program (http://jorisvr.nl/maximummatching.html) and then adds in data generated from the input + above parameters.  Program can be refactored so that these data can be refactored to be read from an external file.
# Next, python script is run (e.g., python contigInput_8400_9050_10997_19515.py> contigOutput.txt)
#output looks like contigOutput.txt
#Three column output:  vertice - vertice - weight
# There are two lines at the end that has the total weight and time (seconds) to run.  These benchmarking should be labeled and commented out.


###
#MWM iteration 2
###

#For the first reconstruction, a minimum number of orth_gene_families that are on the same chromosome and in same ancestral chromosome (bin)
minimumGeneGroupLength>=	3



AncChrNumber	7
genomeInContigIndex	weight
0	2
1	2
2	2
3	2
4	2
5	2
6	2
7	2
8	2
9	3