Cancer onset and progression are known to be regulated by genetic and epigenetic events, including RNA modifications (a.k.a. epitranscriptomics). So far, more than 150 chemical modifications have been described in all RNA subtypes, including messenger, ribosomal, and transfer RNAs. RNA modifications and their regulators are known to be implicated in all steps of post-transcriptional regulation. The dysregulation of this complex yet delicate balance can contribute to disease evolution, particularly in the context of carcinogenesis, where cells are subjected to various stresses. We sought to discover RNA modifications involved in cancer cell adaptation to inhospitable environments, a peculiar feature of cancer stem cells (CSCs). We were particularly interested in the RNA marks that help the adaptation of cancer cells to suspension culture, which is often used as a surrogate to evaluate the tumorigenic potential. For this purpose, we designed an experimental pipeline consisting of four steps: (1) cell culture in different growth conditions to favor CSC survival; (2) simultaneous RNA subtype (mRNA, rRNA, tRNA) enrichment and RNA hydrolysis; (3) the multiplex analysis of nucleosides by LC-MS/MS followed by statistical/bioinformatic analysis; and (4) the functional validation of identified RNA marks. This study demonstrates that the RNA modification landscape evolves along with the cancer cell phenotype under growth constraints. Remarkably, we discovered a short epitranscriptomic signature, conserved across colorectal cancer cell lines and associated with enrichment in CSCs. Functional tests confirmed the importance of selected marks in the process of adaptation to suspension culture, confirming the validity of our approach and opening up interesting prospects in the field.
Motivation: Phylogenetic placement enables phylogenetic analysis of massive collections of newly sequenced DNA, when de novo tree inference is too unreliable or inefficient. Assuming that a high-quality reference tree is available, the idea is to seek the correct placement of the new sequences in that tree. Recently, alignment-free approaches to phylogenetic placement have emerged, both to circumvent the need to align the new sequences and to avoid the calculations that typically follow the alignment step. A promising approach is based on the inference of k-mers that can be potentially related to the reference sequences, also called phylo-k-mers. However, its usage is limited by the time and memory-consuming stage of reference data preprocessing and the large numbers of k-mers to consider. Results: We suggest a filtering method for selecting informative phylo-k-mers based on mutual information, which can significantly improve the efficiency of placement, at the cost of a small loss in placement accuracy. This method is implemented in IPK, a new tool for computing phylo-k-mers that significantly outperforms the software previously available. We also present EPIK, a new software for phylogenetic placement, supporting filtered phylo-k-mer databases. Our experiments on real-world data show that EPIK is the fastest phylogenetic placement tool available, when placing hundreds of thousands and millions of queries while still providing accurate placements. Availability and Implementation: IPK and EPIK are freely available at https://github.com/phylo42/IPK and https://github.com/phylo42/EPIK. Both are implemented in C++ and Python and supported on Linux and MacOS. Contact: nromashchenko@lirmm.fr or rivals@lirmm.fr
Gene expression is the synthesis of proteins from the information encoded on DNA. One of the two main steps of gene expression is the translation of messenger RNA (mRNA) into polypeptide sequences of amino acids. Here, by taking into account mRNA degradation, we model the motion of ribosomes along mRNA with a ballistic model where particles advance along a filament without excluded volume interactions. Unidirectional models of transport have previously been used to fit the average density of ribosomes obtained by the experimental ribo-sequencing (Ribo-seq) technique in order to obtain the kinetic rates. The degradation rate is not, however, accounted for and experimental data from different experiments are needed to have enough parameters for the fit. Here, we propose an entirely novel experimental setup and theoretical framework consisting in splitting the mRNAs into categories depending on the number of ribosomes from one to four. We solve analytically the ballistic model for a fixed number of ribosomes per mRNA, study the different regimes of degradation, and propose a criterion for the quality of the inverse fit. The proposed method provides a high sensitivity to the mRNA degradation rate. The additional equations coming from using the monosome (single ribosome) and polysome (arbitrary number) ribo-seq profiles enable us to determine all the kinetic rates in terms of the experimentally accessible mRNA degradation rate.
IEEE/ACM Transactions on Computational Biology and Bioinformatics
Finding the correct position of new sequences within an established phylogenetic tree is an increasingly relevant problem in evolutionary bioinformatics and metagenomics. Recently, alignment-free approaches for this task have been proposed. One such approach is based on the concept of phylogenetically-informative k-mers or phylo- k-mers for short. In practice, phylo- k-mers are inferred from a set of related reference sequences and are equipped with scores expressing the probability of their appearance in different locations within the input reference phylogeny. Computing phylo- k-mers, however, represents a computational bottleneck to their applicability in real-world problems such as the phylogenetic analysis of metabarcoding reads and the detection of novel recombinant viruses. Here we consider the problem of phylo- k-mer computation: how can we efficiently find all k-mers whose probability lies above a given threshold for a given tree node? We describe and analyze algorithms for this problem, relying on branch-and-bound and divide-and-conquer techniques. We exploit the redundancy of adjacent windows of the alignment to save on computation. Besides computational complexity analyses, we provide an empirical evaluation of the relative performance of their implementations on simulated and real-world data. The divide-and-conquer algorithms are found to surpass the branch-and-bound approach, especially when many phylo- k-mers are found.
The notion of periods is key in stringology, word combinatorics, and pattern matching algorithms. A string has period p if every two letters at distance p from each other are equal. There has been a growing interest in more general models of sequences which can describe uncertainty. An important model of sequences with uncertainty are degenerate strings. A degenerate string is a string with “undetermined” symbols, which can denote arbitrary subsets of the alphabet $Σ$. Degenerate strings have been extensively used to describe uncertainty in DNA, RNA, and protein sequences using the IUPAC code (Biochemistry, 1970). In this work, we extend the work of Blanchet-Sadri et al. (2010) to obtain the following results about the combinatorial aspects of periodicity for degenerate strings: - We compare three natural generalizations of periodicity for degenerate strings, which we refer to as weak, medium and strong periodicity. We define the concept of total autocorrelations, which are quaternary vectors indicating these three notions of periodicity. - We characterize the three families of period sets, as well as the family of total autocorrelations, for each alphabet size. In particular, we prove necessary conditions period sets should satisfy and, to prove sufficiency, we show how to construct a degenerate string which gives rise to particular period sets. - For each notion of periodicity, we (asymptotically) count the number of period sets, by combining known techniques from partial words with recent results from number theory. - Moreover, we show that all families of period sets, as well as the family of total autocorrelations, form lattices under a suitably defined partial ordering. - We compute the population of weak, medium and strong period sets (i.e., the number of strings with that period set). We also compute the population of total autocorrelations.