Relaxing Wheeler Graphs for Indexing Reads
release_vbozjw7w65cbporbin7vau2uae
by
Travis Gagie, Garance Gourdel, Giovanni Manzini, Gonzalo Navarro and
Jared Simpson
2019
Abstract
As industry standards for average-coverage rates increase, DNA readsets are
becoming more repetitive. The run-length compressed Burrows-Wheeler Transform
(RLBWT) is the basis for several powerful algorithms and data structures
designed to handle repetitive genetic datasets, but applying it directly to
readsets is problematic because end-of-string symbols break up runs and, worse,
the characters at the ends of the reads lack context and are thus scattered
throughout the BWT. In this paper we first propose storing the readset as a
Wheeler graph consisting of a set of paths, to avoid end-of-string symbols at
the cost of storing nodes' in- and out-degrees. We then propose rebuilding the
Wheeler graph as if each read were preceded by some imaginary context. This
requires us to relax the constraint that nodes with in-degree 0 in the graph
should appear first in the ordering showing that it is a Wheeler graph, and can
lead to false-positive pattern matches. Nevertheless, we first describe how to
support fast locating, which allows us to filter out false matches and return
all true matches, in time bounded in terms of the total number of matches. More
importantly, we then also show how to augment the RLBWT for the relaxed Wheeler
graph such that we can tell after what point a backward search will return only
false matches, and quickly return as a witness one true match if a backward
search yields any.
In text/plain
format
Archived Files and Locations
application/pdf 1.2 MB
file_pvuwagmugbb2ngbvrx4r2rxd5u
|
arxiv.org (repository) web.archive.org (webarchive) |
1809.07320v3
access all versions, variants, and formats of this works (eg, pre-prints)