TY - GEN
T1 - String searching in referentially compressed genomes
AU - Wandelt, Sebastian
AU - Leser, Ulf
PY - 2012
Y1 - 2012
N2 - Background: Improved sequencing techniques have led to large amounts of biological sequence data. One of the challenges in managing sequence data is efficient storage. Recently, referential compression schemes, storing only the differences between a to-be-compressed input and a known reference sequence, gained a lot of interest in this field. However, so far sequences always have to be decompressed prior to an analysis. There is a need for algorithms working on compressed data directly, avoiding costly decompression. Summary: In our work, we address this problem by proposing an algorithm for exact string search over compressed data. The algorithm works directly on referentially compressed genome sequences, without needing an index for each genome and only using partial decompression. Results: Our string search algorithm for referentially compressed genomes performs exact string matching for large sets of genomes faster than using an index structure, e.g. suffix trees, for each genome, especially for short queries. We think that this is an important step towards space and runtime efficient management of large biological data sets.
AB - Background: Improved sequencing techniques have led to large amounts of biological sequence data. One of the challenges in managing sequence data is efficient storage. Recently, referential compression schemes, storing only the differences between a to-be-compressed input and a known reference sequence, gained a lot of interest in this field. However, so far sequences always have to be decompressed prior to an analysis. There is a need for algorithms working on compressed data directly, avoiding costly decompression. Summary: In our work, we address this problem by proposing an algorithm for exact string search over compressed data. The algorithm works directly on referentially compressed genome sequences, without needing an index for each genome and only using partial decompression. Results: Our string search algorithm for referentially compressed genomes performs exact string matching for large sets of genomes faster than using an index structure, e.g. suffix trees, for each genome, especially for short queries. We think that this is an important step towards space and runtime efficient management of large biological data sets.
KW - Genome compression
KW - Referential compression
KW - String search
UR - https://www.scopus.com/pages/publications/84881518499
M3 - 会议稿件
AN - SCOPUS:84881518499
SN - 9789898565297
T3 - KDIR 2012 - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval
SP - 95
EP - 102
BT - KDIR 2012 - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval
T2 - 4th International Conference on Knowledge Discovery and Information Retrieval, KDIR 2012
Y2 - 4 October 2012 through 7 October 2012
ER -