Gonna

Gonna is a Guilt By Association (GBA) predictor. Contrary to sequence homology which involves inter-species annotation transfers, i.e. genes characterized in other species are used to annotate genes of the newly sequenced genome, GBA approaches involve intra-species annotation transfers: the genes already characterized in the genome, e.g. by wet experiments or using sequence homology, are used for the annotation of the other genes (guilt by association principle). Gene expression data are often used, since genes with similar transcriptomic profiles likely share common functional roles [4][7]. In the same way, protein interaction data are also used since proteins that share common interactors likely share common functions [1][8][2]. These methods provide functional predictions for the uncharacterized genes, and new clues to be compared with the predictions achieved by homology.

Principle

Gonna uses a k-nearest neighbor approach [5]. It takes as input two positive integers K and K' <= K (e.g. K=6 and K'=4), one ontology (MF, BP, or CC), and one postgenomic data source D (e.g. the microarray data of [6]). With this data source, Gonna computes a function SD that measures the similarity SD(g,h) of every pair of genes (g,h). For example, if D is a transcriptomic dataset then SD measures the similarity of profiles using the Pearson correlation coefficient. When asked for the GO categories of a gene g, Gonna uses the SD function to search for the K genes already characterized in the selected ontology by GeneDB, which have the highest level of similarity with g. Then, for each GO term t of the ontology, Gonna looks at these K genes, and if at least K' are associated with t, then g is predicted to be also associated with t; otherwise g is not considered to be in t.

Critical choices

Some choices are critical to insure that Gonna provides relevant and accurate predictions. The first critical choice is related to the similarity measure, which has to capture the "signature" of the gene functions in the dataset at hand. When two genes appear to be similar, this should imply that they share common functions. For transcriptomic (microarray) and proteomic (mass-spectrometry) data, we use the Pearson correlation coefficient that gives high similarity to genes with correlated transcriptomic/proteomic profiles. For the protein-protein interaction data, we use the Czekanovski-Dice metric [3][1], that gives high similarity to pairs of genes that share many interactors.

Another critical choice is related to the K and K' values. K should be neither too large (or else some neighbors will not be similar to the studied gene) nor too low (to avoid reduced, non-representative gene samples). With K' the problem is different. If K' is high (close to K), then the proportion of good predictions is likely high, but only a few predictions could be achieved on the most specific terms of the ontology, and most of the predictions would involve the most general (and hence less interesting) terms. Conversely, if K' is low, then the proportion of good predictions declines, but more predictions are made on the most specific terms. In PlasmoDraft, we use two pairs of parameters (K,K') for each postgenomic data source: one stringent pair (K=6,K'=4) is used to achieve, for each GO term, a first set of predictions that usually has a high proportion of good predictions (see next section for an estimate of this proportion). Next, a second, non-stringent pair (K=6,K'=2) is used to come up with, for each GO term, another set of predictions that cannot be achieved with the stringent setting, but which usually contains a lower proportion of good predictions.

Some properties of Gonna

This k-nearest neighbor predictor has several appealing features. First, it is a direct and simple implementation of the GBA principle, which allows the predictions to be explained by exhibiting the K' genes annotated by GeneDB that support each prediction. Secondly, Gonna can be used with any present and future postgenomic data source, as long as there is a relevant similarity measure. Next, Gonna is consistent with the structure of the ontology. This important property means that if any gene is predicted in a GO term t, then it must be predicted in all terms that generalize t. Finally, Gonna has low computing time, which enables intensive use of the cross-validation procedure to assess the confidence of the predictions.


brehelin at lirmm.fr