Difference between revisions of "UGTs through the genus Brassica"

From CoGepedia
Jump to: navigation, search
(Introduction)
(Methods)
Line 32: Line 32:
 
==Methods==
 
==Methods==
  
===Building a Phylogeny of Genes in the UGTs in ''Arabidopsis thaliana''===
+
===Python Code===
 +
A simple program called FindDifferences.py was developed to isolate the A. thaliana genes that are found in both TAIR’s Glycosyltransferase Family 1 list and CoGe and those found just in the same TAIR list. Any genes found only in CoGe were removed from the list of A. thaliana UGT genes. The program takes in two csv files: one containing the genes from TAIR and one containing the genes from. The program reads each file, changes each gene name to one format using strip functions, then compares each file and returns which gene names are only in one list and prints the list to a new csv. [Insert GitHub link to program].
  
====CoGe BLAST====
+
===CoGe BLAST===
======TAIR======
+
The BLAST tool in CoGe uses a query sequence of amino acids to find other genes or contigs with syntenic regions in A. thaliana. The query sequence, AT5G65550, for the BLAST search used to construct the preliminary tree was chosen randomly from the Glycosyltransferase Family 1 genes of TAIR. This process was repeated several times. The same query sequence was used in BLAST for A. lyrata genes and B. rapa genes. A second query sequence of amino acids, Bra037821, was used from a B. rapa gene orthologous to the A. thaliana gene.  
Glycosyltransferase Family 1 on The Arabidopsis Information Resource (TAIR) contained each annotation by the The Institute for Genomic Research (TIGR) for flavonols and anthocyanidins which contribute to plant pigmentation. CoGe BLAST was used to find sequences corresponding to those in TAIR. After a little coding, we were able to identify from a list of over a hundred which were from the TAIR database and which were from CoGe with ease. Information including genomic locus, TIGR Annotation and Accession are in appropriately named csv files.
+
  
===Finding related UGT genes in ''Arabidopsis lyrata'' and ''Brassica rapa''===
+
===FASTA View===
The FASTA sequence for gene At5g65550 was used as a query sequence in the JGI Phytozome database to recover ''Arabidopsis lyrata'' genes. The Brassica Database (BRAD) was used to recover orthologs for the ''Brassica rapa'' genes.
+
The FASTA View feature in CoGe provides FASTA files for each gene inputted. From CoGe BLAST, selected genes and can be sent to FASTA View where DNA and protein sequences can be retrieved. 122 protein sequences of the 126 found in CoGe BLAST were retrieved for the preliminary tree of A. thaliana. This pool of genes was used for the tree construction.
  
===Organizing the Data===
+
===Phylogeny.fr===
A table was developed to organize the information collected from each database (CoGe, Phytozome, BRAD). [[File:table.png]]
+
Phylogeny.fr is a tree-rendering program for which inputs in the format of FASTA files are used to output a phylogenetic tree. This program has been used to construct four trees so far. The preliminary tree of Arabidopsis thaliana genes consisted of 114 genes showing some interesting tandem duplications.  
  
The table above tracks how the size of data changed along the process of collecting the FASTAs. The blue section of the table denotes information relative to the gene At5g65550 while the orange section of the table denotes information relative to the ''Brassica rapa'' ortholog to At5g65550, Bra037821.
+
===Test Groups===
 +
From the preliminary tree, three test groups were isolated by random sampling (see Table 2).
  
===Determining Test Groups===
+
===SynFind===
Of the 28 Glycosyltransferase Families that The Arabidopsis Information Resource (TAIR) has on Arabidopsis thaliana, I chose to work with Family 1 due to the fact that most of the genes were similar in function as the TIGR Annotation suggested. A preliminary tree was constructed of 122 of these genes (those that CoGe had FASTA sequences for). With use of Keiko Yonekura-Sakakibara’s Functional genomics of family 1 glycosyltransferases in Arabidopsis, I began identifying several functions of the genes. We identified three clusters on the tree that we named Test groups. Test group 1 consists of AT2G36750, AT2G36760, AT2G36770, AT2G36780, AT2G36790, and AT2G36800. These genes exhibit functions of flavonol 7-O-glucosyltransferase and brassinosteroid O-glucosyltransferase. Test group 2 consists of AT3G21780, AT4G15720, AT4G15260, AT4G15280, AT3G21750, and AT3G21760. These genes exhibit the function of having ABA glucosyltransferase activity. Test group 3 consists of AT4G01070, AT1G01420, AT1G01390, AT3G50740, AT5G66690, AT5G26310, AT2G18570, AT2G18560, AT4G36670. These genes exhibit the function of monolignol 4-O-glycosyltransferase activity and having xeniobiotic glycosyltransferase activity. Each of these test groups genes’ FASTAs were placed into one file. Using GeVo, each Arabidopsis thaliana gene was visualized for syntenic regions in Arabidopsis lyrata and Brassica rapa. The FASTAs for those genes in were added to the file and the gevolinks were saved to a separate files for each test group. Using phylogeny.fr, new trees were constructed with Arabidopsis thaliana, Arabidopsis lyrata, and Brassica rapa for each test group.
+
For each gene within each test group, the SynFind tool in CoGe was used to find syntenic regions among all three species, A. thaliana, A. lyrata, and B. rapa. The tool takes uses a gene from A. thaliana and finds regions of synteny in the target genomes selected. SynFind returns a table of gene matches with synteny scores. From this result, several tools and features can be accessed, like GEvo, which was used in this instance.  
  
===Expected Outcome===
+
===GEvo===
I expect Arabidopsis lyrata to have approximately the same number of UGT genes as Arabidopsis thaliana with some polymorphisms and possibly some inversions. Brassica rapa which has undergone a further triplication will likely have approximately three times as many UGT genes as A. thaliana with a significantly higher rate of loss than Arabidopsis lyrata. Some polymorphisms and inversions are also expected.
+
Two steps were taken to continue the research after finding the gene matches. The FASTA sequences were generated using FASTA View and saved to one file. In addition, the gene matches were compared and visualized in GEvo. The GEvo tool aligns the sequences of each gene match and shows regions of synteny with color coordinated boxes. Visualizing in GEvo allows for true matches to be found and others which are just noise to be removed. For the construction of the three new trees, no noise was removed. Links to each visualization were saved to a file.
In order to determine these ratios, a new phylogenetic tree will be constructed of the UGT genes in Arabidopsis thaliana, Arabidopsis lyrata, and Brassica rapa. Areas of interest within the tree will be further analyzed.
+

Revision as of 16:46, 14 July 2016

Introduction

The genus Brassica

The genus Brassica consists of over thirty wild species and hybrids or morphotypes. Generally, species from the genus Brassica are used in food like broccoli, cauliflower, cabbage and more.

The Brassica genome has undergone more polyploidy than Arabidopsis thaliana. Arabidopsis thaliana is notable for being a model organism because of its complexity paired with a relatively small genome.

Duplication Events

The Brassica genome has undergone two tetraploidy and two hexaploidy events, one more than Arabidopsis, since the eudicot paleohexaploidy event which gave rise to Vitis, Prunus, Arabidopsis, and Brassica.

Triangle of U

The "Triangle of U" theory describes the genetic relationship between six species of Brassica: Brassica rapa, Brassica nigra, Brassica oleracea, Brassica juncea, Brassica carinata, and Brassica napus. B. juncea, B. carinata and B. napus are allotetraploids, hybrids with four times the chromosome set of haploids.

Triangle of U sposato.png

UGT Gene Family

UGT functions

Uridine diphosphate (UDP) glycosyltransferases (UGTs) mediate transfer of glycosyl residues from activated nucleotide sugars to acceptor molecules (Tang, Unleashing the Genome of the Brassica rapa). They provide instructions for making enzymes that perform the process of glycosylation, the addition of glucuronic acid to a substrate (Genetics Home Reference, UGT gene family). This pathway is particularly important in metabolism, and many regard the UGT enzyme as the most important enzyme in the pathway. In humans, these enzymes are responsible for the breakdown of several prescription drugs (Glycosyltransferases) Plant UGTs are also known to contribute cellular homeostasis and the detoxification of xenobiotics by metabolizing the pollutants from pesticides.

UGT chemistry

By mediating transfer of glycosyl residues from activated nucleotide sugars to acceptor molecules, UGTs regulate properties of those acceptors such as bioactivity, solubility and transport within cells and throughout organisms (Ross, Higher plant glycosyltransferases). The UGT enzymes in the metabolic process of glucuronidation, a very common process in Phase II metabolism.

Purpose

UGTs are vital to metabolism of all organisms. Several UGT genes of Arabidopsis thaliana have been sequenced already. Looking into the sequences of Arabidopsis lyrata and Brassica rapa, I hope to determine the exact ratio of UGT genes in each of the species, discover similarities among phylogenetic trees, and pinpoint which genes were conserved, lost, and altered.

Preliminary Data

Hypotheses

According to how Arabidopsis lyrata and Brassica rapa diverged from Arabidopsis thaliana, and which duplication events occurred in that time, a ratio exists that would explain the UGT genes for each species if there were no losses of genes over time. That ratio looks something like: 1:1:3 of A. thaliana, A. lyrata, and B. rapa respectively. However, looking at the preliminary data we can access from CoGe, Phytozome, and BRAD, those ratios are not what we observe.

Methods

Python Code

A simple program called FindDifferences.py was developed to isolate the A. thaliana genes that are found in both TAIR’s Glycosyltransferase Family 1 list and CoGe and those found just in the same TAIR list. Any genes found only in CoGe were removed from the list of A. thaliana UGT genes. The program takes in two csv files: one containing the genes from TAIR and one containing the genes from. The program reads each file, changes each gene name to one format using strip functions, then compares each file and returns which gene names are only in one list and prints the list to a new csv. [Insert GitHub link to program].

CoGe BLAST

The BLAST tool in CoGe uses a query sequence of amino acids to find other genes or contigs with syntenic regions in A. thaliana. The query sequence, AT5G65550, for the BLAST search used to construct the preliminary tree was chosen randomly from the Glycosyltransferase Family 1 genes of TAIR. This process was repeated several times. The same query sequence was used in BLAST for A. lyrata genes and B. rapa genes. A second query sequence of amino acids, Bra037821, was used from a B. rapa gene orthologous to the A. thaliana gene.

FASTA View

The FASTA View feature in CoGe provides FASTA files for each gene inputted. From CoGe BLAST, selected genes and can be sent to FASTA View where DNA and protein sequences can be retrieved. 122 protein sequences of the 126 found in CoGe BLAST were retrieved for the preliminary tree of A. thaliana. This pool of genes was used for the tree construction.

Phylogeny.fr

Phylogeny.fr is a tree-rendering program for which inputs in the format of FASTA files are used to output a phylogenetic tree. This program has been used to construct four trees so far. The preliminary tree of Arabidopsis thaliana genes consisted of 114 genes showing some interesting tandem duplications.

Test Groups

From the preliminary tree, three test groups were isolated by random sampling (see Table 2).

SynFind

For each gene within each test group, the SynFind tool in CoGe was used to find syntenic regions among all three species, A. thaliana, A. lyrata, and B. rapa. The tool takes uses a gene from A. thaliana and finds regions of synteny in the target genomes selected. SynFind returns a table of gene matches with synteny scores. From this result, several tools and features can be accessed, like GEvo, which was used in this instance.

GEvo

Two steps were taken to continue the research after finding the gene matches. The FASTA sequences were generated using FASTA View and saved to one file. In addition, the gene matches were compared and visualized in GEvo. The GEvo tool aligns the sequences of each gene match and shows regions of synteny with color coordinated boxes. Visualizing in GEvo allows for true matches to be found and others which are just noise to be removed. For the construction of the three new trees, no noise was removed. Links to each visualization were saved to a file.