Assessing the predictions

Cross-validation is a well known procedure to estimate the proportion of errors of any supervised classification method [5]. This procedure involves: (1) running Gonna on each gene already characterized in GeneDB as if it were an uncharacterized gene, and (2) comparing the predictions to the true annotations. Since no functional information on these genes is supplied to Gonna for the predictions, this procedure provides an unbiased estimate of the method performance [5]. For a given GO term t, the correct predictions in the CV involve the genes predicted in t which are already annotated by this term in GeneDB; the wrong predictions involve the genes predicted in t which are already annotated in the ontology under consideration (MF, BP, or CC) but not with t. Genes without any annotation in the selected ontology are not taken into account. It is convenient to present all of these quantities in tabular form:

predicted in not predicted in
annotated with pa na
not annotated with pn nn




For example, pa denotes the number of genes predicted in the GO term t which are annotated with t in GeneDB, while nn denotes the number of genes not annotated with t that have not been predicted in t in the CV.

TDR

The True Discovery Rate (TDR) associated with a GO term (and for a given data source) is then estimated by

TDR = pa/(pa+pn).

For example, a GO term with a TDR of 80% means that when Gonna predicts that a gene belongs to this term, this prediction has an 80% chance of being correct. Note that due to the incompleteness of the annotations, the above formula may be a conservative estimate of the TDR, because some predictions considered as wrong may actually be correct.

Two sets of predictions are achieved for each GO term and data source using two parameters K' (see Gonna). Therefore, one TDR is estimated for each of these sets: the first TDR reports the accuracy of the predictions achieved with the stringent predictor, while the second TDR reports the accuracy of the predictions achieved with the non-stringent predictor but which are not supported by the stringent one. As expected, the first TDR is usually higher than the second one.

The interest of estimating the TDR of each GO term rather that estimating a global performance on the whole ontology is that it allows to differentiate the GO terms that appear as more suitable to apply a GBA approach with the considered data source. Indeed, all GO terms cannot be predicted with the same accuracy. First because some terms are more general than others--and thus are a priori more likely. But also because some functions (GO terms) have a more apparent signature than others in the type of data considered.

Non-significant TDRs

When the sample size (pa+pn) is low, the TDR estimate may not be fully accurate. So, we also compute the p-value of achieving by chance pa or more correct predictions (among pa+pn) if the true TDR were equal to the prior probability of the term. If this p-value is higher than 5%, then the TDR is not considered to be significantly higher than the prior probability.

PlasmoDraft reports the TDRs of the predictions with a color code that ranges from red (0%) to light green (100%) via yellow (50%), while non-significant TDRs appear in gray.


brehelin at lirmm.fr