MAKER Test

From CoGepedia
Jump to navigation Jump to search

MAKER is a genome annotation pipeline[1]. It allows for a researcher or group of researchers to take a genome, some amount of evidence (for example, an EST file, a protein file (both in FASTA format), and a repeat file, and potentially more), and create structural annotations for a genome. It is capable of training HMM files in order to provide better annotations for a genome with little evidence, although this takes many runs. This page is an attempt to document the work being done to add MAKER into CoGe.

How to Download and Install MAKER

MAKER may be downloaded from the Yandell lab, here: http://www.yandell-lab.org/software/maker.html. The full installation instructions may be found here: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial. The instructions here will just serve as a brief overview for getting MAKER running on the command line in UNIX.

1. Register and download MAKER from the Yandell lab MAKER software page.

2. Unpack MAKER in whichever folder it will be run from.

3. Download and install prerequisites if they are not installed. The minimum prerequisites are:

    a. BioPerl and various other Perl modules (see the MAKER documentation for a complete list[2]).
    b. SNAP
    c. Exonerate
    d. RepeatMasker
    e. NCBI BLAST

4. Add MAKER and its prerequisites to $PATH. For example, the paths might look something like:

    a. MAKER: export PATH="/home/user/maker/bin:$PATH"
    b. RepeatMasker: export PATH="/home/user/RepeatMasker:$PATH"
    c. Exonerate: export PATH="/home/user/exonerate-2.2.0-x86_64/bin:$PATH"
    d. SNAP: export PATH="/home/user/snap:$PATH"


Running MAKER from the UNIX command line

1. Set up the MAKER control files by typing "maker -CTL".

    a. To set the genome sequence, enter the path to the genome FASTA file after "genome=". So, this might look like "genome=dpp_contig.fasta".
       Leave a space between this and the commented description (which starts with a "#" symbol).
    b. Set the EST or mRNA data by typing the path to the desired EST or mRNA fasta file after "est=".
    c. To have MAKER generate structural annotations directly from EST or mRNA data, change "est2genome=0" to "est2genome=1".
    d. Change any other desired settings. For more details on these settings and how to set them, see the full MAKER tutorial[3].

2. Open the MAKER boot options file ("maker_bopts.ctl") and ensure that the correct BLAST search type is selected.

    a. For example, to use NCBI-BLAST, set "blast_type=ncbi+".

3. Edit the MAKER options file ("maker_opts.ctl") to the desired settings.

4. Run MAKER by typing "maker" on the command line.

    a. To prevent MAKER from eating up too many system resources, use the "nice" command.
    b. To run MAKER in the background, and to avoid being spammed with constant messages, send the messages to both stdout and stderr with "&> somefile.txt".
    c. Putting both "a" and "b" together, the command might look something like, "nice maker &> log.txt".

5. MAKER may take about a day or more to run, depending on the size of the genome it is attempting to annotate and the evidence it is given. Once finished, it should place all files into a "genomename.maker.output" folder, with all the data split into separate folders by contigs. Each one of these folders will have several sub-folders, one of which should contain a "chromosomename.gff" file which contains the structural annotations, repeats, and will always end with the contig sequence.

Testing MAKER annotation accuracy with minimal data

The point of this exercise is to see how little data can be used with MAKER and still get reasonably accurate annotations.

The Arabidopsis thaliana genome was annotated using MAKER with the Arabidopsis lyrata mRNA data as evidence. Both the genome and the mRNA annotations were downloaded from GenBank. The A. thaliana genome with the annotations from MAKER using the A. lyrata mRNA evidence was uploaded to CoGe (genome id 25440) and compared to the existing A. thaliana annotations (genome id 16911). The analysis can be recreated here: https://genomevolution.org/r/f6kd.

A fair amount of synteny exists between the two genomes, although noise is present. The actual gene regions are currently being compared to assess accuracy directly. The SynMap listed above is used, and then BLAST is set to "BlastN: Small Regions" in GEvo. Also, under "Sequence Options," "Mask Sequence" is set to "Non-CDS" for both sequences.

Table explanation:

    Coords id16911: The coordinates in the original genome.
    Coords id39871: The coordinates in the MAKER-annotated genome.
    Total Genes id16911: The total number of genes annotated in the original genome. Does not include genes only partially visible in the genome-viewer.
    Total Genes id39871: The total number of genes annotated in the MAKER-annotated genome. Does not include genes only partially visible in the genome-viewer.
    Close Match: The CDS regions of the gene look the same by crude visual analysis. Numbers in parentheses represent the number of matches where a small piece of the gene matched to a different gene.
    Similar Match: A few small differences in the CDS regions can be seen. Numbers in parentheses represent the number of matches where a small piece of the gene matched to a different gene.
    Partial Match: Only a relatively small piece of the genes are matching. Numbers in parentheses represent the number of matches where a small piece of the gene matched to a different gene.
    No Match: The original gene is not represented in the new annotations.
    Pseudogene Misannotated: A pseudogene has been annotated as a functional gene in the MAKER annotations.
    New: A gene in the MAKER annotations is not found in the original genome's annotations.
    Multiple Match (Original:MAKER): These are shown as matching to multiple regions, with the ratio of the number of places on the original genome annotations that match followed by the number of places on the MAKER-annotated genome that match. The quantities of these are also listed. So, an example would look like 1 (only one such multiple match found) 2 (two matches found in the original annotations) : 1 (one match found in the MAKER annotations).
    Split (One gene to two+): Matches where only one gene in the original is present and multiple genes in the MAKER-annotated genome are found. This is different from a multiple match because each region is only matching one place.
    Joined (Two+ genes to one): Matches where multiple genes are present in the original and only one gene is present in the MAKER-annotated genome. This is different from a multiple match because each region is only matching one place.
Sample Number Coords id16911 Coords id39871 Total Genes id16911 Total Genes id39871 Close Match Similar Match Partial Match No Match Pseudogene Misannotated New Multiple Match (Original:MAKER) Split (One gene to two+) Joined (Two+ genes to one)
1 1 544697-648473 1 544697-648473 30 16 10 4 1 12 0 0 0 0 0
2 1 71582-180099 1 71582-180099 36 21 15(5) 2 1 11 0 0 1 2:1 0 1
3 1 1202212-1307510 1 1202212-1307510 26 17 12 3 2 7 0 0 1 2:1 0 0
4 1 1455642-1559002 1 1455642-1559002 29 16 16(9) 0 5 8 0 0 1 2:1 0 1
5 1 1928762-2039295 1 1928762-2039046 33 21 19(2) 1 0 13 1 0 3 2:1 0 0
6 1 2540944-2656892 1 2542248-2656892 31 19 15(6) 1 6 9 0 0 4 3:1, 2 2:1 0 1
7 1 2879268-2981804 1 2879268-2981656 30 18 14(2) 3 8 9 0 0 0 0 1
8 1 3382313-3487979 1 3382313-3487979 27 18 14(2) 1 2 9 0 0 2 2:1 0 0
9 1 3901597-4006840 1 3903894-4006840 32 15 10(1) 3 2 14 0 0 0 0 1
10 1 4197703-4300444 1 4197703-4300444 30 15 15(3) 0 2 13 0 0 1 1:2 0 0
11 1 4958502-5064486 1 4958502-5064486 27 17 15(3) 2 2 7 0 0 1 2:1 0 0
12 1 5380446-5485921 1 5380446-5485921 35 17 13(1) 2(1) 4 15 0 0 2 1:2 0 0
13 1 7384303-7486702 1 7379980-7486702 28 13 6 0 2 8 0 0 4 1:3, 5 1:2, 1 2:1 0 1
14 1 8513858-8619927 1 8513858-8618644 30 15 9(1) 5(2) 2 12 0 0 0 0 1
15 1 9389859-9495818 1 9384718-9495818 31 14 10(2) 1(1) 5 11 0 0 4 1:2 0 2
16 1 9389859-9495818 1 9384718-9495818 26 15 8 5 2 8 0 0 3 1:2 0 0
17 1 12158853-12263571 1 12158853-12263571 22 5 4 0 1 17 0 0 0 0 0
18 1 12859293-12963916 1 12859293-12963916 18 7 3(1) 3(3) 0 10 0 0 1 2:1, 2 1:2 0 0
19 1 14108617-14211652 1 14108617-14211652 5 2 1 1 0 3 0 0 0 0 0
20 1 14108617-14211652 1 14108617-14211652 5 2 1 1 0 3 0 0 0 0 0
21 1 15553892-15657802 1 15553892-15657802 4 2 1 1 0 2 0 0 0 0 0
22 1 17682582-17808194 1 17682582-17805715 29 19 11(1) 6(1) 2 8 0 0 2 1:2 0 1
23 1 20236917-20340245 1 20236944-20340245 28 20 12(1) 5(2) 4 7 0 0 0 0 0
24 1 21984661-22089844 1 21984661-22089844 24 13 10(1) 1(1) 1 9 0 0 1 2:1, 2 1:2 0 1
25 1 26732839-26838712 1 26732839-26838712 33 16 15(2) 1 0 17 0 0 0 0 0

<UNDER CONSTRUCTION>