Indexing huge genome sequences for solving various problems release_5vhtru6hnvhtjlfuytzxwt34ta

by K Sadakane, T Shibuya

Published in Genome Informatics Series.

2001   Volume 12, p175-83

Abstract

Because of the increase in the size of genome sequence databases, the importance of indexing the sequences for fast queries grows. Suffix trees and suffix arrays are used for simple queries. However these are not suitable for complicated queries from huge amount of sequences because the indices are stored in disk which has slow access speed. We propose storing the indices in memory in a compressed form. We use the compressed suffix array. It compactly stores the suffix array at the cost of theoretically a small slowdown in access speed. We experimentally show that the overhead of using the compressed suffix array is reasonable in practice. We also propose an approximate string matching algorithm which is suitable for the compressed suffix array. Furthermore, we have constructed the compressed suffix array of the whole human genome. Because its size is about 2G bytes, a workstation can handle the search index for the whole data in main memory, which will accelerate the speed of solving various problems in genome informatics.
In text/plain format

Archived Files and Locations

application/pdf  152.0 kB
file_raapy5yhy5hlfb2ru3wqjh56mi
www.jsbi.org (web)
web.archive.org (webarchive)
Read Archived PDF
Preserved and Accessible
Type  article-journal
Stage   published
Year   2001
Language   en ?
PubMed  11791236
Container Metadata
Not in DOAJ
In Keepers Registry
ISSN-L:  0919-9454
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: ef3b1fcd-991a-4997-a49e-fcbead545589
API URL: JSON