The proteomes of the nine completely sequenced yeast species are classified into families based on all-to-all sequence comparisons and algorithmic consensus clustering, as described in (Nikolski and Sherman, 2007). The raw alignements used in this computation are homeomorphic [sharing full-length sequence similarity and similar domain architectures (see Wu et al., 2004)] and nonhomeomorphic systematic Smith-Waterman and Blast. The computed families were systematically compared to external data, namely PIR-SF and PIR-CF families (Wu et al.), Genolevures 2 (GL2) families and Genolevures 3 (GL3) curator-defined homolog groups. The best consensus was chosen, using the criteria of coverage (of GL3, GL2 and PIR-SF), and quality metrics internal to consensus algorithm. These families were further classified into two categories: robust families were found using all combinations of statistical parameters and are the most reliable, and consensus families were found using a combination of parameters evaluated using a Condorcet election procedure.
Four types of protein families are defined :
- Robust families GL3R.* were found using all combinations of statistical parameters and are the most reliable.
- Consensus families GL3C.* were found using a combination of parameters evaluated using a Condorcet election procedure, and in some cases manual curation. They often represent a merge of subfamilies.
- Multiple choice families GL3M.* which have a very variable composition dependent on statistical parameters. Many of them concern notoriously complicated families such as polyproteins and repeat domains.
- Unique families GL3U.* correspond to singletons, i.e. one protein per family.
Family identifiers are arbitrary. Each family is associated with a phyletic pattern indicating, for each species, the presence or absence of a protein from that species.
|Date||Release of Génolevures||View||Information|
- byfamily.txt: Tab with one row per family indicating its pattern, its profile, and the names of the genes coding for the proteins.
- byprotein.txt: Tab with one row per protein indicating its family.
- Family relationships: should consensus reign?-consensus clustering for protein families
Nikolski M, Sherman DJ
Bioinformatics, 23(2):e71-e76, 2007
- PIRSF: family classification system at the Protein Information Resource
Wu CH, Nikolskaya A, Huang H, Yeh LS, Natale DA, Vinayaka CR, Hu ZZ, Mazumder R, Kumar S, Kourtesis P, Ledley RS, Suzek BE, Arminski L, Chen T, Zhang J, Cardenas JL, Chung S, Castro-Alvear J, Dinkov G, Barker WC
Nucleic Acids Res, 32:D112-D114, 2004