Overview

Recombinational events such as translocation, inversion or segmental duplication can create accidental fusion of DNA sequences associated with different genes, or conversely the fission of a gene into several parts. Potentially, these events can create new genes from already existing parts, or reciprocally shuffle genes into sub-parts across a genome. These rare events participate in the evolutionary history of the species, and must be taken into account in genome rearrangement models. Gene fusion and fission events are key mechanisms in the evolution of gene architecture, whose effects are visible in protein architecture when they occur in coding sequences.


We chose to focus our study on 12 species covering the phylum of fungi in which a number of complete or near complete genomes are currently available, especially in the group of hemiascomycetes.As the evolutionary distances between genomes are large, even inside the group of hemiascomycetes, the divergence of non-coding sequences is too high to search for fusion events in them. Since our study is restricted to coding sequences, we employed complete proteomes to track fusion and fission events.



Formalization

Definition: Let P1 be one proteome and A be the set of alignements of proteins from P1 against the proteins of proteome P2. We will say that a set A' is the set of filtered alignements if every pair if proteins (a,b) from A' satisfies some filtering predicate F.

Note that P1 and P2 may represent the same proteome.


Definition: A paralog group (P-group) G is a set of protein sequences from the same complete proteome such that

1. for any protein p from G there exists a protein q in G such that F(p,q) is true,
2. the length of the shortest of p and q is at least 70% of the longest.

Note that a P-group may contain only one sequence.


Definition: A gene fusion/fission event is a combination of at least three P-groups, one composite group (C-group) C and several element groups (E-group) E1, E2, ..., that contain at least three sequences c in C, e1 in E1, e2 in E2, ... such that


1. the ei belong to the same proteome,
2. the ei align with c,
3. the ei do not significantly overlap (generally, by less than 10%).

Proteomes searched

PhylumSub-phylumSpeciesDatabase
AscomycotaHemiascomycotaSaccharomyces cerevisiaeSGD
Candida glabrataGénolevures
Kluyveromyces lactisGénolevures
Eremothecium gossypiiAGD
Candida albicansCandidaDB
Debaryomyces hanseniiGénolevures
Yarrowia lipolyticaGénolevures
EuascomycotaNeurospora crassaBroad Institute
Aspergillus nidulansBroad Institute
ArcheascomycotaSchizosaccharomyces pombeWellcome Trust Sanger Institute
BasidiomycotaCryptococcus neoformansStanford Genome Technology Center
ZygomycotaRhizopus oryzaeBroad Institute

Results and Data  

Download files

DateReleaseView
2007/11/021.1GroupComposition.txt
2008/02/061.1Events.txt

  • GroupComposition.txt: Table with one row per protein indicating its P-group and protein name.
    P-Group protein_name
    Data are tab separated. P-group name is made of an acronym and a number. The acronym is built from the first two letters of the genus followed by the first two letters of the species.
  • Events.txt:
[ Event_name , type = type_number
tab Group_name
     :
     :
tab Merge_name ( List_of_Group_names )
     if necessary
     :
     :
tab Group_name=Group_name
     Group-Group relation
     :
     :
tab Merge_name=Group_name
     Merge-Group relation
]

Type numbers : 1 Undecideable ; 2 Fusion ; 3 Fission ; 4 Multiple.
A relation is always written with the E-group on the left side and the C-group on the right.

References