MAKER Test

From CoGepedia
Jump to navigation Jump to search

MAKER is a genome annotation pipeline[1]. It allows for a researcher or group of researchers to take a genome, some amount of evidence (for example, an EST file, a protein file (both in FASTA format), and a repeat file, and potentially more), and create structural annotations for a genome. It is capable of training HMM files in order to provide better annotations for a genome with little evidence, although this takes many runs. This page is an attempt to document the work being done to add MAKER into CoGe.

How to Download and Install MAKER

MAKER may be downloaded from the Yandell lab, here: http://www.yandell-lab.org/software/maker.html. The full installation instructions may be found here: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial. The instructions here will just serve as a brief overview for getting MAKER running on the command line in UNIX.

1. Register and download MAKER from the Yandell lab MAKER software page.

2. Unpack MAKER in whichever folder it will be run from.

3. Download and install prerequisites if they are not installed. The minimum prerequisites are:

    a. BioPerl and various other Perl modules (see the MAKER documentation for a complete list[2]).
    b. SNAP
    c. Exonerate
    d. RepeatMasker
    e. NCBI BLAST

4. Add MAKER and its prerequisites to $PATH. For example, the paths might look something like:

    a. MAKER: export PATH="/home/user/maker/bin:$PATH"
    b. RepeatMasker: export PATH="/home/user/RepeatMasker:$PATH"
    c. Exonerate: export PATH="/home/user/exonerate-2.2.0-x86_64/bin:$PATH"
    d. SNAP: export PATH="/home/user/snap:$PATH"


Running MAKER from the UNIX command line

1. Set up the MAKER control files by typing "maker -CTL".

    a. To set the genome sequence, enter the path to the genome FASTA file after "genome=". So, this might look like "genome=dpp_contig.fasta".
       Leave a space between this and the commented description (which starts with a "#" symbol).
    b. Set the EST or mRNA data by typing the path to the desired EST or mRNA fasta file after "est=".
    c. To have MAKER generate structural annotations directly from EST or mRNA data, change "est2genome=0" to "est2genome=1".
    d. Change any other desired settings. For more details on these settings and how to set them, see the full MAKER tutorial[3].

2. Open the MAKER boot options file ("maker_bopts.ctl") and ensure that the correct BLAST search type is selected.

    a. For example, to use NCBI-BLAST, set "blast_type=ncbi+".

3. Edit the MAKER options file ("maker_opts.ctl") to the desired settings.

4. Run MAKER by typing "maker" on the command line.

    a. To prevent MAKER from eating up too many system resources, use the "nice" command.
    b. To run MAKER in the background, and to avoid being spammed with constant messages, send the messages to both stdout and stderr with "&> somefile.txt".
    c. Putting both "a" and "b" together, the command might look something like, "nice maker &> log.txt".

5. MAKER may take about a day or more to run, depending on the size of the genome it is attempting to annotate and the evidence it is given. Once finished, it should place all files into a "genomename.maker.output" folder, with all the data split into separate folders by contigs. Each one of these folders will have several sub-folders, one of which should contain a "chromosomename.gff" file which contains the structural annotations, repeats, and will always end with the contig sequence.

Testing MAKER annotation accuracy with minimal data

The point of this exercise is to see how little data can be used with MAKER and still get reasonably accurate annotations.

The Arabidopsis thaliana genome was annotated using MAKER with the Arabidopsis lyrata mRNA data as evidence. Both the genome and the mRNA annotations were downloaded from GenBank. The A. thaliana genome with the annotations from MAKER using the A. lyrata mRNA evidence was uploaded to CoGe (genome id 25440) and compared to the existing A. thaliana annotations (genome id 16911). The analysis can be recreated here: https://genomevolution.org/r/f6kd.

A fair amount of synteny exists between the two genomes, although noise is present. The actual gene regions are currently being compared to assess accuracy directly. The SynMap listed above is used, and then BLAST is set to "BlastN: Small Regions" in GEvo. Also, under "Sequence Options," "Mask Sequence" is set to "Non-CDS" for both sequences.


For the region: Arabidopsis thaliana Col-0 (thale cress) (TAIR v10.02, unmasked) AT1G26850.1 (chr: 1 9251146-9353432) and Arabidopsis thaliana (genbank vLJK-Rotation, unmasked) maker-gi|1-exonerate_est2genome-gene-93.12-mRNA-1 (chr: 1 9251146-9353432):

Out of 29 genes in the original, 10 appear to match closely in the MAKER annotated sequence, one matches with some noticeable changes, and 2 more are only partial matches, and thus 16 appear to be missing. One pseudogene in the original is annotated as an actual gene in the MAKER-annotated sequence.

For the region: Arabidopsis thaliana Col-0 (thale cress) (TAIR v10.02, unmasked) AT2G27460.1 (chr: 2 11690670-11794867) and Arabidopsis thaliana (genbank vLJK-Rotation, unmasked) maker-gi|2-exonerate_est2genome-gene-117.3-mRNA-1 (chr: 2 11690670-11794867):

Out of 32 genes in the original, 13 appear to match relatively closely, 1 appears to be a very close match, 13 appear to be missing, and 5 appear to be partial matches.

For the region: Arabidopsis thaliana Col-0 (thale cress) (TAIR v10.02, unmasked) AT1G17540.1 (chr: 1 5979551-6082641) and Arabidopsis thaliana (genbank vLJK-Rotation, unmasked) maker-gi|1-exonerate_est2genome-gene-57.6-mRNA-1 (chr: 1 5684234-5787298):

Out of 27 genes total, none appeared to match closely, and only 3 had partial matches. 8 genes appear to be missing entirely, one pseudogene has been annotated as a functional gene, and another gene has been added.

For the regions: Arabidopsis thaliana Col-0 (thale cress) (TAIR v10.02, unmasked) AT1G41830.1 (chr: 1 15553892-15657802) and Arabidopsis thaliana (genbank vLJK-Rotation, unmasked) maker-gi|1-exonerate_est2genome-gene-156.0-mRNA-1 (chr: 1 15553892-15657802):

Out of 4 genes in the original, 2 appear to be close matches and 2 appear to be missing. None the the 13 pseudogenes in the original were annotated in the MAKER-annotated genome.

For the regions: Arabidopsis thaliana Col-0 (thale cress) (TAIR v10.02, unmasked) AT5G37060.1 (chr: 5 14592741-14695414) and Arabidopsis thaliana (genbank vLJK-Rotation, unmasked) maker-gi|5-exonerate_est2genome-gene-146.9-mRNA-1 (chr: 5 14592741-14695414):

Out of 15 genes in the original, 9 appear to be close matches, 2 appear to be partial matches, and 4 appear to be missing.

For the regions: Arabidopsis thaliana Col-0 (thale cress) (TAIR v10.02, unmasked) AT2G25050.1 (chr: 2 10604108-10709383) and Arabidopsis thaliana (genbank vLJK-Rotation, unmasked) maker-gi|2-exonerate_est2genome-gene-106.9-mRNA-1 (chr: 2 10604108-10709383):

Out of 19 genes in the original, 2 appear to be close or nearly exact matches, 9 appear to be relatively close matches, 3 appear to partial matches, and 5 appear to be missing. 1 pseudogene has been annotated as a gene and 1 gene has been split into two genes.

For the regions: Arabidopsis thaliana Col-0 (thale cress) (TAIR v10.02, unmasked) AT3G56140.1 (chr: 3 20779407-20882669) and Arabidopsis thaliana (genbank vLJK-Rotation, unmasked) maker-gi|2-exonerate_est2genome-gene-168.5-mRNA-1 (chr: 2 16819379-16922569):

Out of 31 genes in the original, 4 appear to be close or relatively close matches, 6 appear to have partial matches, and the rest (21) do not really match up at all. That said, there are 17 genes total in the MAKER-annotated genome, 7 of which do not appear to have any matching annotations.

For the regions: Arabidopsis thaliana Col-0 (thale cress) (TAIR v10.02, unmasked) AT1G51820.1 (chr: 1 19187407-19291883) and Arabidopsis thaliana (genbank vLJK-Rotation, unmasked) maker-gi|2-exonerate_est2genome-gene-124.1-mRNA-1 (chr: 2 12416941-12518612):

Out of 20 genes in the original, 0 have close matches, 14 have partial matches, and 6 have no matches. Oddly, almost all the partial matches in the original are matching up to one gene in the MAKER-annotations. Only 10 genes are annotated in the MAKER-annotations.

For the regions: Arabidopsis thaliana Col-0 (thale cress) (TAIR v10.02, unmasked) AT4G36180.1 (chr: 4 17070209-17173698) and Arabidopsis thaliana (genbank vLJK-Rotation, unmasked) maker-gi|4-exonerate_est2genome-gene-171.7-mRNA-1 (chr: 4 17070209-17173812):

Out of 23 genes in the original, 8 appear to have close matches, 1 is mostly close but is missing a few small parts of the gene, 5 look close but are matching with multiple genes, 2 appear to have partial matches, and 7 appear to have no matches. Only 14 genes are annotated in the MAKER-annotated genome.

For the regions: Arabidopsis thaliana Col-0 (thale cress) (TAIR v10.02, unmasked) AT4G06676.1 (chr: 4 3848998-3951589) and Arabidopsis thaliana (genbank vLJK-Rotation, unmasked) maker-gi|4-exonerate_est2genome-gene-39.0-mRNA-1 (chr: 4 3848998-3951589):

Out of 2 genes in the original, 1 appears to be relatively close, and 1 has no corresponding annotations. Only 1 gene is present in the MAKER-annotations, and the original has 20 pseudogenes that are not annotated in the MAKER-annotated genome.

For the regions: Arabidopsis thaliana Col-0 (thale cress) (TAIR v10.02, unmasked) AT4G10730.1 (chr: 4 6559793-6664786) and Arabidopsis thaliana (genbank vLJK-Rotation, unmasked) maker-gi|4-exonerate_est2genome-gene-66.2-mRNA-1 (chr: 4 6559793-6664786):

Out of 25 genes in the original, 6 appear to be close matches, 1 has a partial match, and 18 have no syntenic matches. Only 6 genes are annotated in the MAKER-annotated genome, but all 6 have close matches.

For the regions: Arabidopsis thaliana Col-0 (thale cress) (TAIR v10.02, unmasked) AT3G22270.1 (chr: 3 7824480-7927857) and Arabidopsis thaliana (genbank vLJK-Rotation, unmasked) maker-gi|3-exonerate_est2genome-gene-78.4-mRNA-1 (chr: 3 7824480-7927857):

Out of 24 genes in the original, 9 appear to be close matches, 6 appear to have partial matches, 4 look close but are matching up with multiple genes, and 5 have no matches. 17 genes are annotated in the MAKER-annotated genome.

For the regions: Arabidopsis thaliana Col-0 (thale cress) (TAIR v10.02, unmasked) AT2G18470.1 (chr: 2 7955285-8057767) and Arabidopsis thaliana (genbank vLJK-Rotation, unmasked) maker-gi|2-exonerate_est2genome-gene-80.8-mRNA-1 (chr: 2 7955285-8057767):

Out of 24 genes in the original, 12 appear to be close matches, 1 appears to be a partial match, and 11 have no matches. Only 13 genes are annotated in the MAKER-annotated genome.

For the regions: Arabidopsis thaliana Col-0 (thale cress) (TAIR v10.02, unmasked) AT1G62340.1 (chr: 1 23001123-23105656) and Arabidopsis thaliana (genbank vLJK-Rotation, unmasked) maker-gi|1-exonerate_est2genome-gene-230.10-mRNA-1 (chr: 1 23001123-23105656):

Out of 21 genes in the original, 6 appear to have close matches, 6 appear to have partial matches that are almost close matches, 1 appears to have a partial match that is not a close match, and 8 appear to have no matching annotations. There are only 12 genes in the MAKER-annotated genome.

<UNDER CONSTRUCTION>