Combining the data sources

When each data source has been used to produce predictions, and TDRs have been estimated for each GO term and each source, Gonna combines all of these results to propose a Global Degree of Belief (GDB) for each prediction.

If gene g has been predicted to be associated with GO term t by one or several sources, Gonna computes the GDB of this prediction in the following way. First, we compute a global confidence score combining the TDRs of term t for each data source that predicts t as well as the fact that some other sources do not support this prediction (see below). This score takes values between 0 (poor confidence) and 1 (high confidence). For each GO term, 4 score categories (very low [0.0,0.25], low ]0.25,0.5], high ]0.5,0.75], and very high ]0.75,1.0]) are considered, and the TDR associated with each category is estimated by way of a second cross-validation procedure: this is done by computing the proportion of successes among already characterized genes that have been predicted in this GO term with a confidence score in this category. These TDRs are then used to estimate the GDB of each prediction. For example, a prediction associated with a GDB of 80% means that 80% of the predictions belonging to the same GO term and the same score category are correct in the second CV procedure. As for the TDRs, we also compute the p-value of the GDBs. If this p-value is higher than 5%, then the GDB is not considered to be significantly higher than the prior probability of the term, and it appears in gray in PlasmoDraft.

Let us now detail how the global confidence score is computed. d₁...d_n and d_n+1...d_m denote the data sources that support, and do no support, the prediction of g in t, respectively. The global confidence score is a rough estimate of the probability that the prediction is correct, given that it is supported by d₁...d_n but not by d_n+1...d_m. Using Bayes theorem, this probability can be written as

P(t | d₁,...,d_n, ¬d_n+1,...,¬d_m ) = (P(d₁,...,d_n, ¬d_n+1,...,¬d_m |t) × P(t))/(P(d₁,...,d_n, ¬d_n+1,...,¬d_m))

P(t) is the prior probability of term t. We use the conditional independence assumption [5] to estimate the other terms, that is:

P(d₁,...,d_n, ¬d_n+1,...,¬d_m | t) ~ P(d₁ | t)×...×P(d_n | t) ×P(¬d_n+1 | t)×...×P(¬d_m | t),

P(d₁,...,d_n, ¬d_n+1,...,¬d_m | ¬t) ~ P(d₁ | ¬t)×...×P(d_n | ¬t) ×P(¬d_n+1 | ¬t)×...×P(¬d_m | ¬t),

and

P(d₁,...,d_n, ¬d_n+1,...,¬d_m ) = P(d₁,...,d_n, ¬d_n+1,...,¬d_m | t) P(t) + P(d₁,...,d_n, ¬d_n+1,...,¬d_m | ¬t) P(¬t).

Terms P(d_i), P(¬d_i), P(d_i|t), P(¬d_i|t), P(d_i|¬t), and P(¬d_i|¬t) are estimated with the quantities computed in the CV. For example, P(d_i|t) is estimated by the ratio p_a/(p_a+n_a), P(¬d_i|¬t) is estimated with n_n/(p_n+n_n), and P(d_i) by (p_a+p_n)/(n_a+n_n+p_a+p_n). Thus, from the three above equations, the conditional probability of t can be estimated and it constitutes our global confidence score. Due to the (strong) independence assumption, this score actually cannot be interpreted as the probability of t. But its forms the basis of the "naive Bayes" predictor, which was shown to be fairly accurate in a number of applications [5]. When the predictor (conditional probability) is larger/lower than a given threshold, the gene is predicted/not-predicted in t. Here we use 3 thresholds (0.25, 0.5 and 0.75) and estimate by CV the TDR within each interval. This TDR is the GDB of t for the studied gene.

brehelin at lirmm.fr