Ancestral Reconstruction Pipeline: Difference between revisions
Line 42: | Line 42: | ||
* -o <output directory name> | * -o <output directory name> | ||
'''Note:''' Multiple output files, one for each ancestral chromosome bin | '''Note:''' Multiple output files, one for each ancestral chromosome bin | ||
'''Note:''' | |||
'''Note:''' There are three sets of data that are needed for the subsequent steps: | |||
* Directory named "scaffolds". These tab delimited files with each line containing: vertex vertex weight. These data will be used by the MEMPython program | |||
* Directory named "binfiles". These files are used in the final program ScaffoldOutput to assign the correct contigs to the bins | |||
* File named "contig2genes.txt" | |||
====MWMPython==== | ====MWMPython==== |
Revision as of 01:19, 26 April 2014

Plan for refactoring
GetGenomes: remove config file, add option to specify output dir for output files
This program gets data ready for downstream processing
- -d <directory of input synmap files>
- -g gid1,gid2,gid3,gid4... <list of common separated coge genome ids>
- -p p1,p2,p3,p4 <list of comma separated ploidy levels for genomes -- note these are paired ordered data with the -g option?>
- -s < subgenome_file>
- -o <output directory for SubGenomeInGeneOrder OrthologSets GenomeInString files>
GetContigInput
This program gets data ready for ancestral ordering of genes by MWM
- Remove config file dependency
- -g gid1,gid2,gid3,gid4... <list of common separated coge genome ids>
- -w w1,w2,w3,w4 <list of comma separated weights for genomes -- note these are paired ordered data with the -g option?>
- -wa <threshold minimum adjacency score for keeping a contig. Called 'weightOfAdjacent' in original config file>
- -i <input file: GenomesInString from program GetGenomes>
- -o <output file name>
Note: Output is now a tab delimited file with each line containing: vertex vertex weight . These data will be used by the MEMPython data below
MWMPython: http://jorisvr.nl/maximummatching.html needs command line options for
This program is a general tool for Maximum Weight Matching. First run is for ancestral gene ordering. Second run is for ancestral contig ordering
- -i <input file or directory>
- File type is a set of vertex vertex weight
- note: if directory, will batch process all files
- -o <outfile or directory> If no option is specified, the results go to STDOUT
GetContigOutputAndScaffoldInput
This program maps ancestral contigs back to various genomes, gets their positions, and gets data formatted for a second MWM to generate ancestral ordered contigs Note: goal is to get everything onto the command line. Currently, several of the files are hardcoded
- -mml <threshold minimum mapping length. The minimum number of "genes" mapping to a subgenome. Called 'minimumGeneGroupLength' in original config file>
Note: genomeInContigIndex is specified in the config file. This is the weighting for each subgenome. These values will be removed and the weights for genomes used instead:
- -g gid1,gid2,gid3,gid4... <list of common separated coge genome ids>
- -w w1,w2,w3,w4 <list of comma separated weights for genomes -- note these are paired ordered data with the -g option>
- -co <configoutput file generated by the MWMPython program>
- -s <subGenomesInGeneOrder file>
- -gf <genomeInString file>
- -o <output directory name>
Note: Multiple output files, one for each ancestral chromosome bin
Note: There are three sets of data that are needed for the subsequent steps:
- Directory named "scaffolds". These tab delimited files with each line containing: vertex vertex weight. These data will be used by the MEMPython program
- Directory named "binfiles". These files are used in the final program ScaffoldOutput to assign the correct contigs to the bins
- File named "contig2genes.txt"
MWMPython
Reuse of the same program listed above for ancestral ordering of contigs
ScaffoldOutput
Putting all the output data back together for the final ancestral genome
- -im <directory of MWM files>
- -ib <directory composition of bin files>
- -cg <file containing the looking up of gene families comprising contigs
- -o <output file of reconstructed genome>