MAKER Test

From CoGepedia
Jump to: navigation, search

MAKER is a genome annotation pipeline[1]. It allows for a researcher or group of researchers to take a genome, some amount of evidence (for example, an EST file, a protein file (both in FASTA format), and a repeat file, and potentially more), and create structural annotations for a genome. It is capable of training HMM files in order to provide better annotations for a genome with little evidence, although this takes many runs. This page is an attempt to document the work being done to add MAKER into CoGe.

How to Download and Install MAKER

MAKER may be downloaded from the Yandell lab, here: http://www.yandell-lab.org/software/maker.html. The full installation instructions may be found here: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial. The instructions here will just serve as a brief overview for getting MAKER running on the command line in UNIX.

1. Register and download MAKER from the Yandell lab MAKER software page.

2. Unpack MAKER in whichever folder it will be run from.

3. Download and install prerequisites if they are not installed. The minimum prerequisites are:

    a. BioPerl and various other Perl modules, listed below (see the MAKER documentation for a complete list[2]).
         i. BioPerl
         ii. DBI
         iii. Error
         iv. Error::Simple
         v. File::NFSLock
         vi. File::Which
         vii. Inline
         viii. Perl::Unsafe::Signals
         ix. Proc::Signal
         x. URI::Escape
         xi. Bit::Vector
         xii. Inline::C
         xiii. PerlIO::gzip
         xiv. forks
         xv. forks::shared
    b. SNAP
    c. Exonerate
    d. RepeatMasker
    e. NCBI BLAST

4. Add MAKER and its prerequisites to $PATH. For example, the paths might look something like:

    a. MAKER: export PATH="/home/user/maker/bin:$PATH"
    b. RepeatMasker: export PATH="/home/user/RepeatMasker:$PATH"
    c. Exonerate: export PATH="/home/user/exonerate-2.2.0-x86_64/bin:$PATH"
    d. SNAP: export PATH="/home/user/snap:$PATH"

Running MAKER from the UNIX command line

1. Set up the MAKER control files by typing "maker -CTL".

    a. To set the genome sequence, enter the path to the genome FASTA file after "genome=". So, this might look like "genome=dpp_contig.fasta".
       Leave a space between this and the commented description (which starts with a "#" symbol).
    b. Set the EST or mRNA data by typing the path to the desired EST or mRNA fasta file after "est=".
    c. To have MAKER generate structural annotations directly from EST or mRNA data, change "est2genome=0" to "est2genome=1".
    d. Change any other desired settings. For more details on these settings and how to set them, see the full MAKER tutorial[3].

2. Open the MAKER boot options file ("maker_bopts.ctl") and ensure that the correct BLAST search type is selected.

    a. For example, to use NCBI-BLAST, set "blast_type=ncbi+".

3. Edit the MAKER options file ("maker_opts.ctl") to the desired settings.

4. Run MAKER by typing "maker" on the command line.

    a. To prevent MAKER from eating up too many system resources, use the "nice" command.
    b. To run MAKER in the background, and to avoid being spammed with constant messages, send the messages to both stdout and stderr with "&> somefile.txt".
    c. Putting both "a" and "b" together, the command might look something like, "nice maker &> log.txt".

5. MAKER may take about a day or more to run, depending on the size of the genome it is attempting to annotate and the evidence it is given. Once finished, it should place all files into a "genomename.maker.output" folder, with all the data split into separate folders by contigs. Each one of these folders will have several sub-folders, one of which should contain a "chromosomename.gff" file which contains the structural annotations, repeats, and will always end with the contig sequence.

Testing MAKER annotation accuracy with minimal data

The point of this exercise is to see how little data can be used with MAKER and still get reasonably accurate annotations.

The Arabidopsis thaliana genome was annotated using MAKER with the Arabidopsis lyrata mRNA data as evidence. Both the genome and the mRNA annotations were downloaded from GenBank. The A. thaliana genome with the annotations from MAKER using the A. lyrata mRNA evidence was uploaded to CoGe (genome id 25440) and compared to the existing A. thaliana annotations (genome id 16911). The analysis can be recreated here: https://genomevolution.org/r/f6kd.

A fair amount of synteny exists between the two genomes, although noise is present. The actual gene regions are currently being compared to assess accuracy directly. The SynMap listed above is used, and then BLAST is set to "BlastN: Small Regions" in GEvo. Also, under "Sequence Options," "Mask Sequence" is set to "Non-CDS" for both sequences.

Table explanation:

    Sample Number: Numerically tracks each trial.
    Coords id16911: The coordinates in the original genome.
    Coords id39871: The coordinates in the MAKER-annotated genome.
    Total Genes id16911: The total number of genes annotated in the original genome. Does not include genes only partially visible in the genome-viewer.
    Total Genes id39871: The total number of genes annotated in the MAKER-annotated genome. Does not include genes only partially visible in the genome-viewer.
    Close Match: The CDS regions of the gene look the same by crude visual analysis. Numbers in parentheses represent the number of matches where a small piece of the gene matched to a different gene.
    Similar Match: A few small differences in the CDS regions can be seen. Numbers in parentheses represent the number of matches where a small piece of the gene matched to a different gene.
    Partial Match: Only a relatively small piece of the genes are matching. Numbers in parentheses represent the number of matches where a small piece of the gene matched to a different gene.
    No Match: The original gene is not represented in the new annotations.
    Pseudogene Misannotated: A pseudogene has been annotated as a functional gene in the MAKER annotations.
    New: A gene in the MAKER annotations is not found in the original genome's annotations.
    Multiple Match (Original:MAKER): These are shown as matching to multiple regions, with the ratio of the number of places on the original genome annotations that match followed by the number of places on the MAKER-annotated genome that match. The quantities of these are also listed. So, an example would look like 1 (only one such multiple match found) 2 (two matches found in the original annotations) : 1 (one match found in the MAKER annotations).
    Split (One gene to two+): Matches where only one gene in the original is present and multiple genes in the MAKER-annotated genome are found. This is different from a multiple match because each region is only matching one place.
    Joined (Two+ genes to one): Matches where multiple genes are present in the original and only one gene is present in the MAKER-annotated genome. This is different from a multiple match because each region is only matching one place.

</tr>

Sample Number Coords id16911 Coords id39871 Total Genes id16911 Total Genes id39871 Close Match Similar Match Partial Match No Match Pseudogene Misannotated New Multiple Match (Original:MAKER) Split (One gene to two+) Joined (Two+ genes to one)
1 1 544697-648473 1 544697-648473 30 16 10 4 1 12 0 0 0 0 0
2 1 71582-180099 1 71582-180099 36 21 15(5) 2 1 11 0 0 1 2:1 0 1
3 1 1202212-1307510 1 1202212-1307510 26 17 12 3 2 7 0 0 1 2:1 0 0
4 1 1455642-1559002 1 1455642-1559002 29 16 16(9) 0 5 8 0 0 1 2:1 0 1
5 1 1928762-2039295 1 1928762-2039046 33 21 19(2) 1 0 13 1 0 3 2:1 0 0
6 1 2540944-2656892 1 2542248-2656892 31 19 15(6) 1 6 9 0 0 4 3:1, 2 2:1 0 1
7 1 2879268-2981804 1 2879268-2981656 30 18 14(2) 3 8 9 0 0 0 0 1
8 1 3382313-3487979 1 3382313-3487979 27 18 14(2) 1 2 9 0 0 2 2:1 0 0
9 1 3901597-4006840 1 3903894-4006840 32 15 10(1) 3 2 14 0 0 0 0 1
10 1 4197703-4300444 1 4197703-4300444 30 15 15(3) 0 2 13 0 0 1 1:2 0 0
11 1 4958502-5064486 1 4958502-5064486 27 17 15(3) 2 2 7 0 0 1 2:1 0 0
12 1 5380446-5485921 1 5380446-5485921 35 17 13(1) 2(1) 4 15 0 0 2 1:2 0 0
13 1 7384303-7486702 1 7379980-7486702 28 13 6 0 2 8 0 0 4 1:3, 5 1:2, 1 2:1 0 1
14 1 8513858-8619927 1 8513858-8618644 30 15 9(1) 5(2) 2 12 0 0 0 0 1
15 1 9389859-9495818 1 9384718-9495818 31 14 10(2) 1(1) 5 11 0 0 4 1:2 0 2
16 1 9389859-9495818 1 9384718-9495818 26 15 8 5 2 8 0 0 3 1:2 0 0
17 1 12158853-12263571 1 12158853-12263571 22 5 4 0 1 17 0 0 0 0 0
18 1 12859293-12963916 1 12859293-12963916 18 7 3(1) 3(3) 0 10 0 0 1 2:1, 2 1:2 0 0
19 1 14108617-14211652 1 14108617-14211652 5 2 1 1 0 3 0 0 0 0 0
20 1 14108617-14211652 1 14108617-14211652 5 2 1 1 0 3 0 0 0 0 0
21 1 15553892-15657802 1 15553892-15657802 4 2 1 1 0 2 0 0 0 0 0
22 1 17682582-17808194 1 17682582-17805715 29 19 11(1) 6(1) 2 8 0 0 2 1:2 0 1
23 1 20236917-20340245 1 20236944-20340245 28 20 12(1) 5(2) 4 7 0 0 0 0 0
24 1 21984661-22089844 1 21984661-22089844 24 13 10(1) 1(1) 1 9 0 0 1 2:1, 2 1:2 0 1
25 1 26732839-26838712 1 26732839-26838712 33 16 15(2) 1 0 17 0 0 0 0 0
26 5 185117-290911 5 185117-290911 30 19 12(2) 5(1) 4(1) 8 0 0 0 0 0
27 5 2484720-2590086 5 2484720-2590086 29 18 14(6) 2 2(2) 6 0 0 2 3:1 0 0
28 5 5240999-5347779 5 5240999-5347779 30 20 18(4) 2 1 8 0 0 1 2:1 0 0
29 5 7948309-8052594 5 7948309-8052594 29 16 12(4) 2(1) 2 6 0 0 0 0 2
30 5 10746648-10848147 5 10746648-10848147 12 1 1 0 0 11 0 0 0 0 0
31 5 13424499-13532238 5 13434606-13537546 10 4 4(1) 0 0 5 0 0 1 2:1 0 0
32 5 16020402-16123101 5 16020402-16123101 29 9 6(3) 1(1) 4 17 0 1 1 2:1, 2 1:2 0 0
33 5 18741802-18845407 5 18741802-18845407 25 17 12(2) 6(3) 1 3 0 0 1 2:1, 2 1:2 0 0
34 5 21699561-21801099 5 21384155-21488362 26 15 0 0 3(1) 23 0 0 1 1:2 0 0
35 5 24375824-24480269 5 24375824-24480269 19 13 8(5) 5(1) 1 5 0 0 1 1:2 0 0
36 5 26726994-26835104 5 26726994-26835104 34 17 14(5) 3 2 12 0 0 1 2:1 0 0
37 5 1395509-1499568 5 1395509-1499568 28 19 17(2) 2 1(1) 8 0 0 0 0 0
38 5 3963813-4071018 5 3963813-4070909 29 21 20(2) 1 1 7 0 0 0 0 0
39 5 6645731-6751247 5 6645731-6751247 35 21 19(4) 2 2 12 0 0 0 0 0
40 5 9395950-9500584 5 9395950-9500584 28 16 12 3 1 12 0 0 0 0 0
41 5 12035328-12138569 5 12035328-12138569 4 4 3 1 0 0 0 0 0 0 0
42 5 14198202-14303272 5 14198202-14303110 17 8 6 1 0 0 0 0 0 0 1
43 5 17439327-17544830 5 17439468-17544830 28 14 8(1) 5(1) 3(1) 11 0 0 1 2:1 0 0
44 5 20126385-20238307 5 20126385-20238307 21 14 11(7) 2(1) 1 5 0 0 1 1:2, 4 1:4 0 1
45 5 22960801-23065559 5 22960764-23065559 26 14 10(5) 4(2) 2 8 0 0 2 2:1 0 0
46 5 25587492-25693902 5 25587492-25693902 30 23 19(7) 3 3(1) 5 0 0 7 3:1 0 0
47 3 1096943-1203341 3 1096943-1203341 32 19 17(5) 0 2 8 0 0 3 1:2, 1 2:1 0 0
48 3 2215142-2329383 3 2215142-2329383 29 21 15(4) 3(2) 3(1) 7 0 0 3 1:2 0 0
49 3 3432575-3541667 3 3432575-3541908 36 21 15(4) 6(2) 4 11 0 0 0 0 0
50 3 4704624-4811185 3 4704624-4811185 25 15 15(5) 0 3 7 0 0 0 0 0
51 3 5886108-5996205 3 5886108-5996205 30 15 11 4 0 15 0 0 0 0 0
52 3 7009098-7112660 3 7009100-7112660 31 16 12(1) 3(1) 1 8 0 0 1 6:1, 1 2:1 0 0
53 3 8153548-8257243 3 8153548-8257243 18 6 4(3) 2 3 9 0 0 0 0 0
54 3 9258942-9363353 3 9258942-9363353 25 12 9(2) 3 2 10 0 0 0 0 0
55 3 10561071-10666301 3 10543921-10648775 16 5 1 4 0 9 0 0 5 1:3 0 0
56 3 11760867-11863509 3 11760867-11863509 3 1 1 0 0 2 0 0 0 0 0
57 3 12982458-13085348 3 12983120-13085348 7 1 0 1 0 6 0 0 0 0 0
58 3 14035354-14143449 3 14035354-14143449 5 3 2 1 0 2 0 0 0 0 0
59 3 15184235-15295034 3 15184235-15290697 12 6 5(2) 1 0 6 0 0 0 0 0
60 3 16347038-16449725 3 16347038-16449725 20 13 9(6) 2(1) 0 8 0 0 2 1:2, 3 1:3 1 0
61 3 17415629-17517888 3 17415629-17517888 23 13 11(2) 2(1) 0 10 0 0 2 1:2 0 0
62 3 18636527-18750533 3 18636527-18750533 26 12 8(3) 3(1) 1 11 0 0 1 4:1, 1 1:2, 3 1:3 0 0
63 3 19855826-19960027 3 19855826-19960027 36 25 22(2) 3(1) 1 10 0 0 0 0 0
64 3 21065078-21171051 3 21065078-21171051 32 19 15(3) 4(2) 6 6 0 0 2 2:1, 1 3:1 0 0
65 3 22173829-22279195 3 22173829-22279195 23 14 12(4) 1 0 10 1 0 1 3:1 0 0
66 3 23140428-23245727 3 23140428-23245727 31 15 11(2) 3 1 13 0 0 2 1:2 0 1
67 2 902313-1009004 2 902313-1007874 23 9 7 1 2 12 0 0 1 2:1 0 0
68 2 1933901-2035341 2 1925574-2026339 17 7 5(4) 1 0 6 0 0 1 2:1, 5 1:3 0 0
69 2 2535215-2637601 2 2535215-2637601 8 2 2 0 0 4 0 0 0 0 0
70 2 2997623-3107099 2 2997623-3107099 9 1 1 0 0 7 0 0 0 0 0
71 2 4569145-4671448 2 4569145-4671448 7 1 1 0 0 5 0 0 0 0 0
72 2 5064881-5168486 2 5064881-5168486 4 3 3 0 0 1 0 0 0 0 0
73 2 5645124-5756134 2 5645124-5756134 18 6 6(1) 0 0 12 0 0 2 1:2 0 0
74 2 6724150-6827899 2 6724150-6827068 18 11 6(3) 2 0 5 0 0 2 1:2, 3 1:3 1 1
75 2 7734455-7852230 2 7734455-7852230 28 18 14(1) 3 2 9 0 0 0 0 0
76 2 8809211-8914699 2 8809211-8914699 37 18 16(2) 2(1) 2 16 0 0 1 2:1 0 0
77 2 9716127-9819766 2 9716127-9819766 28 13 10(1) 1(1) 1 10 0 0 7 1:7 0 2
78 2 10817070-10922077 2 10817070-10921395 26 13 8(2) 2 1 12 0 0 0 0 0
79 2 11690670-11794867 2 11690670-11794867 33 17 15(3) 1 1 15 0 0 1 2:1, 2 1:2 0 0
80 2 12802632-12907369 2 12802632-12907608 25 16 14(1) 1 1 8 0 0 1 2:1 0 0
81 2 13702665-13806233 2 13702665-13806233 26 18 14(5) 2 1 9 0 0 2 1:2 0 0
82 2 14604611-14707443 2 14604644-14707443 19 13 9(4) 2 3 4 0 0 0 0 0
83 2 15650550-15755165 2 15650550-15755165 25 12 10(2) 2 1 11 0 0 1 2:1 0 0
84 2 16665089-16773406 2 16665089-16773406 26 15 13(3) 0 3 9 0 0 0 0 0
85 2 17684541-17788679 2 17684541-17788679 24 15 14(3) 1 3 5 0 0 1 2:1 0 0
86 2 18707881-18822229 2 18707881-18822229 32 18 15(6) 3(1) 5 8 0 0 4 1:3 0 0
87 4 751713-858018 4 751713-858018 31 16 10(1) 3 0 12 0 0 1 2:1 0 2
88 4 1722114-1824380 4 1722169-1824380 3 2 0 1 0 2 1 0 0 0 0
89 4 17593684-17695729 4 17593684-17695729 23 13 10 3(1) 0 9 0 0 1 2:1 0 0
90 4 16618075-16721974 4 16618075-16721974 25 15 11(3) 4(2) 2 8 0 0 1 2:1 0 0
91 4 2696288-2802663 4 2696288-2802663 21 11 9(3) 2 0 6 0 0 1 2:1, 5 1:2 0 0
92 4 3714499-3816439 4 3714499-3816439 1 1 1 0 0 0 0 0 0 0 0
93 4 4837112-4938939 4 4837112-4938939 4 2 2 0 0 0 0 0 0 0 0
94 4 6473657-6578058 4 6473696-6578058 22 11 8(5) 2 1 10 0 0 9 1:2 0 1
95 4 8312586-8416525 4 8312586-8416525 23 12 7(3) 3 2(1) 11 0 0 0 0 0
96 4 10189947-10297372 4 10189947-10297372 30 15 13(3) 1 4 11 0 0 1 3:1 0 0
97 4 11983703-12093572 4 11983703-12093572 26 16 13(1) 3 2 8 0 0 1 2:1 0 0
98 4 13898993-14007840 4 13898993-14007840 38 23 22(2) 0 2 14 0 0 0 0 0
99 4 14876974-14979681 4 14876974-14979681 32 22 21(1) 1(1) 2 8 0 0 2 1:2, 3 1:3 0 0
100 4 15741039-15845643 4 15741039-15845643 22 19 16(5) 3(2) 0 3 0 0 2 1:2 0 0
Column Totals N/A N/A 2362 1312 1006(219) 198(42) 156(9) 852 3 1 N/A 2 22

Summary Percentages

1312/2362 = 56% of genes annotated

1006/1312 = 77% had same CDS

1204/1312 = 92% had similar or better CDS

852/2362 = 36% of genes were not annotated


Note: 56% + 36% = 92%, so where is the missing 8%? The answer is that many genes had multiple syntenic matches, and these are accounted for in the "partial matches" column and the "joined" and "split" columns. If these were to be separated out, then the total missed genes would be (2362 * 0.08) + 852 = 1041.