Discriminative Learning for Probabilistic Sequence Analysis release_dk32fbfxrrhfromr6i365bs6rm

by Jonas Maaskola, Universitätsbibliothek Der FU Berlin, Universitätsbibliothek Der FU Berlin

Published by Freie Universität Berlin.



This dissertation presents a study of discriminative learning techniques for probabilistic sequence analysis that find application in pattern discovery of binding sites in nucleic acid sequences. Sets of positive and negative example sequences define contrasts that are mined for sequence motifs whose occurrence frequency varies between the sets. A discriminative motif discovery method based on hidden Markov models (HMMs) is described that allows choice of different objective functions, two of which are used for the first time for motif finding with HMMs: mutual information of condition and motif occurrence (MICO), and Matthews correlation coefficient. We perform an extensive and systematic comparison of motif discovery performance of our method and numerous published tools. Using MICO or several other of the implemented objective functions, our method's performance exceeds that of all other tools. MICO is also the most generally useful discriminative objective function, as it is applicable both to the analysis of probabilistic as well as discrete binding motif models, can leverage contrasts of more than two conditions, and provides natural extensions to quantify conditional association that are used to build models of multiple motifs. The investigation concludes with several case studies comprising 30 datasets from transcriptome-scale technologies —ChIP-Seq, RIP-ChIP, and PAR-CLIP—of embryonic stem cell transcription factors and of RNA-binding proteins. The case studies demonstrate practicality and utility of the method, and validate it by reproducing motifs of well-studied proteins. In addition, they provide novel insights by connecting previously known splicing-relevant motifs to an alternative splicing regulator. The presented motif discovery method scales to large data sizes, makes use of available repeat experiments for increased statistical power, and aside from binary contrasts also more complex data configurations can be utilized. It is implemented in the open source software Discrover (portmanteau of dis [...]
In text/plain format

Archived Files and Locations

application/pdf  6.8 MB
refubium.fu-berlin.de (publisher)
web.archive.org (webarchive)
Read Archived PDF
Preserved and Accessible
Type  article
Stage   published
Date   2015-07-15
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: 08718c96-18e5-4b29-ad26-b1a1c100f758