A Novel Riboswitch Classification based on Imbalanced Sequences achieved by Machine Learning release_oaj6of62gfbppkdzwkvobkto7u

by Solomon Shiferaw Beyene, Tianyi Ling, Blagoj Ristevski, Ming Chen

Released as a post by Cold Spring Harbor Laboratory.



<jats:title>ABSTRACT</jats:title>Riboswitch, a part of mRNA (50–250nt in length), has two main classes: aptamer and expression platform. One of the main challenges raised during the classification of riboswitch is imbalanced data. That is a circumstance in which the records of a dataset of one group are very small compared to the others. Such circumstances lead classifier to ignore minority group and emphasize on majority class, that resulting with a skewed classification. We considered sixteen riboswitch families, to be in accord with recent riboswitch classification work, that contain imbalanced dataset ranging from 4,826 instances (RF00174) to 39 (RF01051) instances. The dataset was divided into training and test set using new developed pipeline. From 5460 <jats:italic>k</jats:italic>-mers, 156 features were produced calculated based on <jats:italic>CfsSubsetEval</jats:italic> and <jats:italic>BestFirst.</jats:italic> Statistically tested result was significantly difference between balanced and imbalanced dataset (<jats:italic>p</jats:italic> &lt; 0.05). Besides, each algorithm also showed a significant difference in sensitivity, specificity, accuracy, and macro F-score when used in both groups (<jats:italic>p</jats:italic> &lt; 0.05). Several <jats:italic>k</jats:italic>-mers clustered from heat map were discovered to have biological functions and motifs at the different positions like interior loops, terminal loops and helices. They were validated to have a biological function and some are riboswitch motifs. The analysis has discovered the importance of solving the challenges of majority bias analysis and overfitting. Presented results were generalized evaluation of both balanced and imbalanced models, which implies their ability of classifying novel riboswitches. The scientific community can use python source code at <jats:ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="https://github.com/Seasonsling/riboswitch">https://github.com/Seasonsling/riboswitch</jats:ext-link>, which can contribute to the process of developing software packages.<jats:sec><jats:title>Author Summary</jats:title>Machine learning application has been used in many ways in bioinformatics and computational biology. Its use in riboswitch classification is still limited and existing attempt showed challenges due to imbalanced dataset. Algorithms classify dataset with majority and minority group, but they tend to ignore minority group and emphasize on majority class, consequential return a skewed classification We used new pipeline including SMOTE for balancing datasets that showed better classified riboswitch as well as improved performance of algorithms selected. Statistically significant difference observed between balanced and imbalanced in sensitivity, specificity, accuracy and F-score, this proved balanced dataset better for classification of riboswitch. Biological functions and motif search of <jats:italic>k</jats:italic>-mers in riboswitch families revealed their presence in interior loops, terminal loops and helices, some of the <jats:italic>k</jats:italic>-mers were reported to be riboswitch motifs of aptamer domains and critical for metabolite binding. The pipeline can be used in machine learning and deep learning study in other domains of bioinformatics and computational biology suffering from imbalanced dataset. Finally, scientific community can use python source code, the work done and flow to develop packages.</jats:sec>
In application/xml+jats format

Archived Files and Locations

application/pdf  17.7 MB
www.biorxiv.org (repository)
web.archive.org (webarchive)
application/pdf  17.7 MB
www.biorxiv.org (repository)
web.archive.org (webarchive)
application/pdf  17.7 MB
www.biorxiv.org (repository)
web.archive.org (webarchive)
Read Archived PDF
Preserved and Accessible
Type  post
Stage   unknown
Date   2020-03-02
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: 68bcb62e-a697-4b5b-8080-32a17922bd41