MAKER Test: Difference between revisions
Line 52: | Line 52: | ||
A fair amount of synteny exists between the two genomes, although noise is present. The actual gene regions are currently being compared to assess accuracy directly. The SynMap listed above is used, and then BLAST is set to "BlastN: Small Regions" in GEvo. Also, under "Sequence Options," "Mask Sequence" is set to "Non-CDS" for both sequences. | A fair amount of synteny exists between the two genomes, although noise is present. The actual gene regions are currently being compared to assess accuracy directly. The SynMap listed above is used, and then BLAST is set to "BlastN: Small Regions" in GEvo. Also, under "Sequence Options," "Mask Sequence" is set to "Non-CDS" for both sequences. | ||
1. For the regions: Arabidopsis thaliana Col-0 (thale cress) (TAIR v10.02, unmasked) AT1G02730.1 (chr: 1 544697-648473) and Arabidopsis thaliana (genbank vLJK-Rotation, unmasked) maker-gi|1-exonerate_est2genome-gene-6.6-mRNA-1 (chr: 1 544697-648473): | 1. For the regions: Arabidopsis thaliana Col-0 (thale cress) (TAIR v10.02, unmasked) AT1G02730.1 (chr: 1 544697-648473) and Arabidopsis thaliana (genbank vLJK-Rotation, unmasked) maker-gi|1-exonerate_est2genome-gene-6.6-mRNA-1 (chr: 1 544697-648473): | ||
Out of 30 genes in the original, 4 appear to be identical, 5 appear nearly identical (no green CDS or gray full gene differences), 1 appears to be a close match (no noticeable CDS differences), 4 appear to be relatively close matches (CDS looks similar), 1 appears to be a partial matches (CDS only matches a little), and 12 appear to have no matches. 16 genes are annotated in the MAKER-annotated genome. | Out of 30 genes in the original, 4 appear to be identical, 5 appear nearly identical (no green CDS or gray full gene differences), 1 appears to be a close match (no noticeable CDS differences), 4 appear to be relatively close matches (CDS looks similar), 1 appears to be a partial matches (CDS only matches a little), and 12 appear to have no matches. 16 genes are annotated in the MAKER-annotated genome. | ||
<table border="1" cellpadding="5" cellspacing="5"> | |||
<tr> | |||
<th>Coords id16911</th> | |||
<th>Coords id39871</th> | |||
<th>Total Genes id16911</th> | |||
<th>Close Match</th> | |||
<th>Similar Match</th> | |||
<th>Partial Match</th> | |||
<th>No Match</th> | |||
<th>Pseudogene Misannotated</th> | |||
<th>New</th> | |||
<th>Multiple (Original:MAKER)</th> | |||
<th>Split (One gene to two+)</th> | |||
<th>Joined (Two+ genes to one)</th> | |||
<th>Total Genes id 39871</th> | |||
</tr> | |||
<tr> | |||
<td>1 544697-648473</td> | |||
<td>1 544697-648473</td> | |||
<td>30</td> | |||
<td>10</td> | |||
<td>4</td> | |||
<td>1</td> | |||
<td>12</td> | |||
<td>0</td> | |||
<td>0</td> | |||
<td>0</td> | |||
<td>0</td> | |||
<td>0</td> | |||
<td>16</td> | |||
</tr> | |||
<tr> | |||
<td>1 71582-180099</td> | |||
<td>1 71582-180099</td> | |||
<td>36</td> | |||
<td>15(5)</td> | |||
<td>2</td> | |||
<td>1</td> | |||
<td>11</td> | |||
<td>0</td> | |||
<td>0</td> | |||
<td>1 2:1</td> | |||
<td>0</td> | |||
<td>1</td> | |||
<td>16</td> | |||
</tr> | |||
</table> | |||
<UNDER CONSTRUCTION> | <UNDER CONSTRUCTION> |
Revision as of 23:43, 9 March 2015
MAKER is a genome annotation pipeline[1]. It allows for a researcher or group of researchers to take a genome, some amount of evidence (for example, an EST file, a protein file (both in FASTA format), and a repeat file, and potentially more), and create structural annotations for a genome. It is capable of training HMM files in order to provide better annotations for a genome with little evidence, although this takes many runs. This page is an attempt to document the work being done to add MAKER into CoGe.
How to Download and Install MAKER
MAKER may be downloaded from the Yandell lab, here: http://www.yandell-lab.org/software/maker.html. The full installation instructions may be found here: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial. The instructions here will just serve as a brief overview for getting MAKER running on the command line in UNIX.
1. Register and download MAKER from the Yandell lab MAKER software page.
2. Unpack MAKER in whichever folder it will be run from.
3. Download and install prerequisites if they are not installed. The minimum prerequisites are:
a. BioPerl and various other Perl modules (see the MAKER documentation for a complete list[2]). b. SNAP c. Exonerate d. RepeatMasker e. NCBI BLAST
4. Add MAKER and its prerequisites to $PATH. For example, the paths might look something like:
a. MAKER: export PATH="/home/user/maker/bin:$PATH" b. RepeatMasker: export PATH="/home/user/RepeatMasker:$PATH" c. Exonerate: export PATH="/home/user/exonerate-2.2.0-x86_64/bin:$PATH" d. SNAP: export PATH="/home/user/snap:$PATH"
Running MAKER from the UNIX command line
1. Set up the MAKER control files by typing "maker -CTL".
a. To set the genome sequence, enter the path to the genome FASTA file after "genome=". So, this might look like "genome=dpp_contig.fasta". Leave a space between this and the commented description (which starts with a "#" symbol). b. Set the EST or mRNA data by typing the path to the desired EST or mRNA fasta file after "est=". c. To have MAKER generate structural annotations directly from EST or mRNA data, change "est2genome=0" to "est2genome=1". d. Change any other desired settings. For more details on these settings and how to set them, see the full MAKER tutorial[3].
2. Open the MAKER boot options file ("maker_bopts.ctl") and ensure that the correct BLAST search type is selected.
a. For example, to use NCBI-BLAST, set "blast_type=ncbi+".
3. Edit the MAKER options file ("maker_opts.ctl") to the desired settings.
4. Run MAKER by typing "maker" on the command line.
a. To prevent MAKER from eating up too many system resources, use the "nice" command. b. To run MAKER in the background, and to avoid being spammed with constant messages, send the messages to both stdout and stderr with "&> somefile.txt". c. Putting both "a" and "b" together, the command might look something like, "nice maker &> log.txt".
5. MAKER may take about a day or more to run, depending on the size of the genome it is attempting to annotate and the evidence it is given. Once finished, it should place all files into a "genomename.maker.output" folder, with all the data split into separate folders by contigs. Each one of these folders will have several sub-folders, one of which should contain a "chromosomename.gff" file which contains the structural annotations, repeats, and will always end with the contig sequence.
Testing MAKER annotation accuracy with minimal data
The point of this exercise is to see how little data can be used with MAKER and still get reasonably accurate annotations.
The Arabidopsis thaliana genome was annotated using MAKER with the Arabidopsis lyrata mRNA data as evidence. Both the genome and the mRNA annotations were downloaded from GenBank. The A. thaliana genome with the annotations from MAKER using the A. lyrata mRNA evidence was uploaded to CoGe (genome id 25440) and compared to the existing A. thaliana annotations (genome id 16911). The analysis can be recreated here: https://genomevolution.org/r/f6kd.
A fair amount of synteny exists between the two genomes, although noise is present. The actual gene regions are currently being compared to assess accuracy directly. The SynMap listed above is used, and then BLAST is set to "BlastN: Small Regions" in GEvo. Also, under "Sequence Options," "Mask Sequence" is set to "Non-CDS" for both sequences.
1. For the regions: Arabidopsis thaliana Col-0 (thale cress) (TAIR v10.02, unmasked) AT1G02730.1 (chr: 1 544697-648473) and Arabidopsis thaliana (genbank vLJK-Rotation, unmasked) maker-gi|1-exonerate_est2genome-gene-6.6-mRNA-1 (chr: 1 544697-648473):
Out of 30 genes in the original, 4 appear to be identical, 5 appear nearly identical (no green CDS or gray full gene differences), 1 appears to be a close match (no noticeable CDS differences), 4 appear to be relatively close matches (CDS looks similar), 1 appears to be a partial matches (CDS only matches a little), and 12 appear to have no matches. 16 genes are annotated in the MAKER-annotated genome.
Coords id16911 | Coords id39871 | Total Genes id16911 | Close Match | Similar Match | Partial Match | No Match | Pseudogene Misannotated | New | Multiple (Original:MAKER) | Split (One gene to two+) | Joined (Two+ genes to one) | Total Genes id 39871 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 544697-648473 | 1 544697-648473 | 30 | 10 | 4 | 1 | 12 | 0 | 0 | 0 | 0 | 0 | 16 |
1 71582-180099 | 1 71582-180099 | 36 | 15(5) | 2 | 1 | 11 | 0 | 0 | 1 2:1 | 0 | 1 | 16 |
<UNDER CONSTRUCTION>