GFF ingestion: Difference between revisions

From CoGepedia
Jump to navigation Jump to search
No edit summary
No edit summary
 
(6 intermediate revisions by one other user not shown)
Line 1: Line 1:
CoGe translates many of the features from a [http://www.sequenceontology.org/resources/gff3.html standard GFF file] into various [[genomic features]] in CoGe's database.  For a basic protein coding gene, CoGe tracks three major genomic features:
[[File:Screen shot 2012-04-17 at 1.10.42 PM.png|thumb|right|400px|CoGe visualization of [[genomic feature]] from the rice genome]]
 
CoGe translates many of the features from a [https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md standard GFF file] into various [[genomic features]] in CoGe's database.  For a basic protein coding gene, CoGe tracks three major genomic features:
*[[gene]]:  the full extent of the transcribed unit including introns
*[[gene]]:  the full extent of the transcribed unit including introns
*[[mRNA]]:  the spliced transcript
*[[mRNA]]:  the spliced transcript
*[[CDS]]: the regions that code for protein.
*[[CDS]]: the regions that code for protein.


[[File:Screen shot 2012-04-17 at 1.10.42 PM.png|thumb|left|400px|CoGe visualization of [[genomic feature]] from the rice genome]]
<div style="clear: both"></div>


From the GFF3 entry below, the gene and mRNA features are collapsed to a gene in CoGe, the exons are combined to make an mRNA in CoGe, and the CDSs are used as a CDS feature in CoGe.  The UTRs are skipped as being redundant with the exons.
From the GFF3 entry below, the gene and mRNA features are collapsed to a gene in CoGe, the exons are combined to make an mRNA in CoGe, and the CDSs are used as a CDS feature in CoGe.  The UTRs are skipped as being redundant with the exons.
Line 25: Line 27:
Chr1    MSU_osa1r7      three_prime_UTR 15360  15915  .      +      .      ID=LOC_Os01g01030.1:utr_2;Parent=LOC_Os01g01030.1
Chr1    MSU_osa1r7      three_prime_UTR 15360  15915  .      +      .      ID=LOC_Os01g01030.1:utr_2;Parent=LOC_Os01g01030.1
</pre>
</pre>
CoGe requires columns 1, 3, 4, 5, 7, and 9 (seqid, type, start, end, strand, and attributes) to have valid values and ignores the other columns.  The attributes column must be formatted into key/value pairs delimited by semi-colons and have an ID or Name for each line.

Latest revision as of 21:09, 15 December 2016

CoGe visualization of genomic feature from the rice genome

CoGe translates many of the features from a standard GFF file into various genomic features in CoGe's database. For a basic protein coding gene, CoGe tracks three major genomic features:

  • gene: the full extent of the transcribed unit including introns
  • mRNA: the spliced transcript
  • CDS: the regions that code for protein.

From the GFF3 entry below, the gene and mRNA features are collapsed to a gene in CoGe, the exons are combined to make an mRNA in CoGe, and the CDSs are used as a CDS feature in CoGe. The UTRs are skipped as being redundant with the exons.

Example GFF entry for a protein coding gene from the rice genome (v7)

Chr1    MSU_osa1r7      gene    12648   15915   .       +       .       ID=LOC_Os01g01030;Name=LOC_Os01g01030;Note=monocopper%20oxidase%2C%20putative%2C%20expressed
Chr1    MSU_osa1r7      mRNA    12648   15915   .       +       .       ID=LOC_Os01g01030.1;Name=LOC_Os01g01030.1;Parent=LOC_Os01g01030
Chr1    MSU_osa1r7      exon    12648   13813   .       +       .       ID=LOC_Os01g01030.1:exon_1;Parent=LOC_Os01g01030.1
Chr1    MSU_osa1r7      exon    13906   14271   .       +       .       ID=LOC_Os01g01030.1:exon_2;Parent=LOC_Os01g01030.1
Chr1    MSU_osa1r7      exon    14359   14437   .       +       .       ID=LOC_Os01g01030.1:exon_3;Parent=LOC_Os01g01030.1
Chr1    MSU_osa1r7      exon    14969   15171   .       +       .       ID=LOC_Os01g01030.1:exon_4;Parent=LOC_Os01g01030.1
Chr1    MSU_osa1r7      exon    15266   15915   .       +       .       ID=LOC_Os01g01030.1:exon_5;Parent=LOC_Os01g01030.1
Chr1    MSU_osa1r7      five_prime_UTR  12648   12773   .       +       .       ID=LOC_Os01g01030.1:utr_1;Parent=LOC_Os01g01030.1
Chr1    MSU_osa1r7      CDS     12774   13813   .       +       .       ID=LOC_Os01g01030.1:cds_1;Parent=LOC_Os01g01030.1
Chr1    MSU_osa1r7      CDS     13906   14271   .       +       .       ID=LOC_Os01g01030.1:cds_2;Parent=LOC_Os01g01030.1
Chr1    MSU_osa1r7      CDS     14359   14437   .       +       .       ID=LOC_Os01g01030.1:cds_3;Parent=LOC_Os01g01030.1
Chr1    MSU_osa1r7      CDS     14969   15171   .       +       .       ID=LOC_Os01g01030.1:cds_4;Parent=LOC_Os01g01030.1
Chr1    MSU_osa1r7      CDS     15266   15359   .       +       .       ID=LOC_Os01g01030.1:cds_5;Parent=LOC_Os01g01030.1
Chr1    MSU_osa1r7      three_prime_UTR 15360   15915   .       +       .       ID=LOC_Os01g01030.1:utr_2;Parent=LOC_Os01g01030.1

CoGe requires columns 1, 3, 4, 5, 7, and 9 (seqid, type, start, end, strand, and attributes) to have valid values and ignores the other columns. The attributes column must be formatted into key/value pairs delimited by semi-colons and have an ID or Name for each line.