Relaxing Wheeler Graphs for Indexing Reads release_vbozjw7w65cbporbin7vau2uae

by Travis Gagie, Garance Gourdel, Giovanni Manzini, Gonzalo Navarro and Jared Simpson

Released as a article .

2019  

Abstract

As industry standards for average-coverage rates increase, DNA readsets are becoming more repetitive. The run-length compressed Burrows-Wheeler Transform (RLBWT) is the basis for several powerful algorithms and data structures designed to handle repetitive genetic datasets, but applying it directly to readsets is problematic because end-of-string symbols break up runs and, worse, the characters at the ends of the reads lack context and are thus scattered throughout the BWT. In this paper we first propose storing the readset as a Wheeler graph consisting of a set of paths, to avoid end-of-string symbols at the cost of storing nodes' in- and out-degrees. We then propose rebuilding the Wheeler graph as if each read were preceded by some imaginary context. This requires us to relax the constraint that nodes with in-degree 0 in the graph should appear first in the ordering showing that it is a Wheeler graph, and can lead to false-positive pattern matches. Nevertheless, we first describe how to support fast locating, which allows us to filter out false matches and return all true matches, in time bounded in terms of the total number of matches. More importantly, we then also show how to augment the RLBWT for the relaxed Wheeler graph such that we can tell after what point a backward search will return only false matches, and quickly return as a witness one true match if a backward search yields any.
In text/plain format

Archived Files and Locations

application/pdf  1.2 MB
file_pvuwagmugbb2ngbvrx4r2rxd5u
arxiv.org (repository)
web.archive.org (webarchive)
Read Archived PDF
Preserved and Accessible
Type  article
Stage   submitted
Date   2019-02-01
Version   v3
Language   en ?
arXiv  1809.07320v3
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: 2b39be97-a988-4ac5-854a-36c2f1f7858f
API URL: JSON