Faster Approximate Pattern Matching in Compressed Repetitive Texts release_ne4svhmwoved5owhf45jgj3fpi

by Travis Gagie and Paweł Gawrychowski and Christopher Hoobin and Simon J. Puglisi

Released as a article .

2012  

Abstract

Motivated by the imminent growth of massive, highly redundant genomic databases, we study the problem of compressing a string database while simultaneously supporting fast random access, substring extraction and pattern matching to the underlying string(s). Bille et al. (2011) recently showed how, given a straight-line program with r rules for a string s of length n, we can build an r-word data structure that allows us to extract any substring of length m in n + m time. They also showed how, given a pattern p of length m and an edit distance (k ≤ m), their data structure supports finding all approximate matches to p in s in r ( (m k, k^4 + m) + n) + time. Rytter (2003) and Charikar et al. (2005) showed that r is always at least the number z of phrases in the LZ77 parse of s, and gave algorithms for building straight-line programs with z n rules. In this paper we give a simple z n-word data structure that takes the same time for substring extraction but only z (m k, k^4 + m) + time for approximate pattern matching.
In text/plain format

Archived Files and Locations

application/pdf  436.2 kB
file_hvtjoc5rbzfp5o26jplxfprigm
arxiv.org (repository)
web.archive.org (webarchive)
Read Archived PDF
Preserved and Accessible
Type  article
Stage   submitted
Date   2012-09-09
Version   v3
Language   en ?
arXiv  1109.2930v3
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: 12a18559-2a0a-4928-a959-5a2c80c0c6c3
API URL: JSON