Rcsi: Scalable similarity search in thousand(s) of genomes

  • Sebastian Wandelt*
  • , Johannes Starlinger
  • , Marc Bux
  • , Ulf Leser
  • *Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

Abstract

Until recently, genomics has concentrated on comparing sequences between species. However, due to the sharply falling cost of sequencing technology, studies of populations of individuals of the same species are now feasible and promise advances in areas such as personalized medicine and treatment of genetic diseases. A core operation in such studies is read mapping, i.e., finding all parts of a set of genomes which are within edit distance k to a given query sequence (k-approximate search). To achieve sufficient speed, current algorithms solve this problem only for one to-be-searched genome and compute only approximate solutions, i.e., they miss some kapproximate occurrences. We present RCSI, Referentially Compressed Search Index, which scales to a thousand genomes and computes the exact answer. It exploits the fact that genomes of different individuals of the same species are highly similar by first compressing the to-be-searched genomes with respect to a reference genome. Given a query, RCSI then searches the reference and all genome-specific individual differences. We propose efficient data structures for representing compressed genomes and present algorithms for scalable compression and similarity search. We evaluate our algorithms on a set of 1092 human genomes, which amount to approx. 3 TB of raw data. RCSI compresses this set by a ratio of 450:1 (26:1 including the search index) and answers similarity queries on a mid-class server in 15 ms on average even for comparably large error thresholds, thereby significantly outperforming other methods. Furthermore, we present a fast and adaptive heuristic for choosing the best reference sequence for referential compression, a problem.

Original languageEnglish
Pages (from-to)1534-1545
Number of pages12
JournalProceedings of the VLDB Endowment
Volume6
Issue number13
DOIs
StatePublished - Aug 2013
Externally publishedYes
Event39th International Conference on Very Large Data Bases, VLDB 2012 - Trento, Italy
Duration: 26 Aug 201330 Aug 2013

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 3 - Good Health and Well-being
    SDG 3 Good Health and Well-being

Fingerprint

Dive into the research topics of 'Rcsi: Scalable similarity search in thousand(s) of genomes'. Together they form a unique fingerprint.

Cite this