MAKER Test: Difference between revisions
Line 51: | Line 51: | ||
The ''Arabidopsis thaliana'' genome was annotated using MAKER with the ''Arabidopsis lyrata'' mRNA data as evidence. Both the genome and the mRNA annotations were downloaded from GenBank. The ''A. thaliana'' genome with the annotations from MAKER using the ''A. lyrata'' mRNA evidence was uploaded to CoGe (genome id 25440) and compared to the existing ''A. thaliana'' annotations (genome id 16911). The analysis can be recreated here: https://genomevolution.org/r/f6kd. | The ''Arabidopsis thaliana'' genome was annotated using MAKER with the ''Arabidopsis lyrata'' mRNA data as evidence. Both the genome and the mRNA annotations were downloaded from GenBank. The ''A. thaliana'' genome with the annotations from MAKER using the ''A. lyrata'' mRNA evidence was uploaded to CoGe (genome id 25440) and compared to the existing ''A. thaliana'' annotations (genome id 16911). The analysis can be recreated here: https://genomevolution.org/r/f6kd. | ||
A fair amount of synteny exists between the two genomes, although noise is present. The actual gene regions are currently being compared to assess accuracy directly. The SynMap listed above is used, and then BLAST is set to "BlastN: Small Regions" in GEvo. Also, under "Sequence Options," "Mask Sequence" is set to "Non-CDS". | A fair amount of synteny exists between the two genomes, although noise is present. The actual gene regions are currently being compared to assess accuracy directly. The SynMap listed above is used, and then BLAST is set to "BlastN: Small Regions" in GEvo. Also, under "Sequence Options," "Mask Sequence" is set to "Non-CDS" for both sequences. | ||
For the region: Arabidopsis thaliana Col-0 (thale cress) (TAIR v10.02, unmasked) AT1G26850.1 (chr: 1 9251146-9353432) and Arabidopsis thaliana (genbank vLJK-Rotation, unmasked) maker-gi|1-exonerate_est2genome-gene-93.12-mRNA-1 (chr: 1 9251146-9353432): | For the region: Arabidopsis thaliana Col-0 (thale cress) (TAIR v10.02, unmasked) AT1G26850.1 (chr: 1 9251146-9353432) and Arabidopsis thaliana (genbank vLJK-Rotation, unmasked) maker-gi|1-exonerate_est2genome-gene-93.12-mRNA-1 (chr: 1 9251146-9353432): | ||
Out of 29 in the original, 10 appear to match closely in the MAKER annotated sequence, one matches with some noticeable changes, and 2 more are only partial matches. One pseudogene in the original is annotated as an actual gene in the MAKER-annotated sequence. | Out of 29 genes in the original, 10 appear to match closely in the MAKER annotated sequence, one matches with some noticeable changes, and 2 more are only partial matches, and thus 16 appear to be missing. One pseudogene in the original is annotated as an actual gene in the MAKER-annotated sequence. | ||
For the region: Arabidopsis thaliana Col-0 (thale cress) (TAIR v10.02, unmasked) AT2G27460.1 (chr: 2 11690670-11794867) and Arabidopsis thaliana (genbank vLJK-Rotation, unmasked) maker-gi|2-exonerate_est2genome-gene-117.3-mRNA-1 (chr: 2 11690670-11794867): | |||
Out of 32 genes in the original, 13 appear to match relatively closely, 1 appears to be a very close match, 13 appear to be missing, and 5 appear to be partial matches. | |||
<UNDER CONSTRUCTION> | <UNDER CONSTRUCTION> |
Revision as of 01:47, 7 March 2015
MAKER is a genome annotation pipeline[1]. It allows for a researcher or group of researchers to take a genome, some amount of evidence (for example, an EST file, a protein file (both in FASTA format), and a repeat file, and potentially more), and create structural annotations for a genome. It is capable of training HMM files in order to provide better annotations for a genome with little evidence, although this takes many runs. This page is an attempt to document the work being done to add MAKER into CoGe.
How to Download and Install MAKER
MAKER may be downloaded from the Yandell lab, here: http://www.yandell-lab.org/software/maker.html. The full installation instructions may be found here: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial. The instructions here will just serve as a brief overview for getting MAKER running on the command line in UNIX.
1. Register and download MAKER from the Yandell lab MAKER software page.
2. Unpack MAKER in whichever folder it will be run from.
3. Download and install prerequisites if they are not installed. The minimum prerequisites are:
a. BioPerl and various other Perl modules (see the MAKER documentation for a complete list[2]). b. SNAP c. Exonerate d. RepeatMasker e. NCBI BLAST
4. Add MAKER and its prerequisites to $PATH. For example, the paths might look something like:
a. MAKER: export PATH="/home/user/maker/bin:$PATH" b. RepeatMasker: export PATH="/home/user/RepeatMasker:$PATH" c. Exonerate: export PATH="/home/user/exonerate-2.2.0-x86_64/bin:$PATH" d. SNAP: export PATH="/home/user/snap:$PATH"
Running MAKER from the UNIX command line
1. Set up the MAKER control files by typing "maker -CTL".
a. To set the genome, enter the path to the genome after "genome=". So, this might look like "genome=dpp_contig.fasta". Leave a space between this and the commented description (which starts with a "#" symbol). b. Set the EST or mRNA data by typing the path to the desired EST or mRNA fasta file after "est=". c. To have MAKER generate structural annotations directly from EST or mRNA data, change "est2genome=0" to "est2genome=1". d. Change any other desired settings. For more details on these settings and how to set them, see the full MAKER tutorial[3].
2. Open the MAKER boot options file ("maker_bopts.ctl") and ensure that the correct BLAST search type is selected.
a. For example, to use NCBI-BLAST, set "blast_type=ncbi+".
3. Edit the MAKER options file ("maker_opts.ctl") to the desired settings.
4. Run MAKER by typing "maker" on the command line.
a. To prevent MAKER from eating up too many system resources, use the "nice" command. b. To run MAKER in the background, and to avoid being spammed with constant messages, send the messages to both stdout and stderr with "&> somefile.txt". c. Putting both "a" and "b" together, the command might look something like, "nice maker &> log.txt".
5. MAKER may take about a day or more to run, depending on the size of the genome it is attempting to annotate and the evidence it is given. Once finished, it should place all files into a "genomename.maker.output" folder, with all the data split into separate folders by contigs. Each one of these folders will have several sub-folders, one of which should contain a "chromosomename.gff" file which contains the structural annotations, repeats, and will always end with the contig sequence.
Testing MAKER annotation accuracy with minimal data
The point of this exercise is to see how little data can be used with MAKER and still get reasonably accurate annotations.
The Arabidopsis thaliana genome was annotated using MAKER with the Arabidopsis lyrata mRNA data as evidence. Both the genome and the mRNA annotations were downloaded from GenBank. The A. thaliana genome with the annotations from MAKER using the A. lyrata mRNA evidence was uploaded to CoGe (genome id 25440) and compared to the existing A. thaliana annotations (genome id 16911). The analysis can be recreated here: https://genomevolution.org/r/f6kd.
A fair amount of synteny exists between the two genomes, although noise is present. The actual gene regions are currently being compared to assess accuracy directly. The SynMap listed above is used, and then BLAST is set to "BlastN: Small Regions" in GEvo. Also, under "Sequence Options," "Mask Sequence" is set to "Non-CDS" for both sequences.
For the region: Arabidopsis thaliana Col-0 (thale cress) (TAIR v10.02, unmasked) AT1G26850.1 (chr: 1 9251146-9353432) and Arabidopsis thaliana (genbank vLJK-Rotation, unmasked) maker-gi|1-exonerate_est2genome-gene-93.12-mRNA-1 (chr: 1 9251146-9353432):
Out of 29 genes in the original, 10 appear to match closely in the MAKER annotated sequence, one matches with some noticeable changes, and 2 more are only partial matches, and thus 16 appear to be missing. One pseudogene in the original is annotated as an actual gene in the MAKER-annotated sequence.
For the region: Arabidopsis thaliana Col-0 (thale cress) (TAIR v10.02, unmasked) AT2G27460.1 (chr: 2 11690670-11794867) and Arabidopsis thaliana (genbank vLJK-Rotation, unmasked) maker-gi|2-exonerate_est2genome-gene-117.3-mRNA-1 (chr: 2 11690670-11794867):
Out of 32 genes in the original, 13 appear to match relatively closely, 1 appears to be a very close match, 13 appear to be missing, and 5 appear to be partial matches.
<UNDER CONSTRUCTION>