Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage

Holley, Guillaume; Wittler, Roland; Stoye, Jens

Titelaufnahme

Titel
Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage
Verfasser
Holley, Guillaume ; Wittler, Roland ; Stoye, Jens
Enthalten in
Algorithms for Molecular Biology, Jg. 11 H. 1
Erschienen
2016
Sprache
Englisch
Dokumenttyp
Aufsatz in einer Zeitschrift
Schlagwörter
Pan-genome Similar genomes Population genomics Colored de bruijn graph Bloom filter Compression Trie Index Succinct data structure
ISSN
1748-7188
URN
urn:nbn:de:0070-pub-29001293
DOI
10.1186/s13015-016-0066-8

Zugriffsbeschränkung

Das Dokument ist frei verfügbar

Links

Social Media

Share
Nachweis
Kein Nachweis verfügbar
IIIF
IIIF-Manifest

Dateien

Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage [pdf 1.72 mb]
RIS

Klassifikation

Klassifikation (DDC) → Naturwissenschaften und Mathematik → Biowissenschaften; Biologie → Biowissenschaften; Biologie

Abstract

Background

High throughput sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species, producing a high number of sequences sampled from each genome. Such a highly redundant collection of very similar sequences is called a pan-genome. It can be transformed into a set of sequences “colored” by the genomes to which they belong. A colored de Bruijn graph (C-DBG) extracts from the sequences all colored k-mers, strings of length k, and stores them in vertices.
Results

In this paper, we present an alignment-free, reference-free and incremental data structure for storing a pan-genome as a C-DBG: the bloom filter trie (BFT). The data structure allows to store and compress a set of colored k-mers, and also to efficiently traverse the graph. Bloom filter trie was used to index and query different pangenome datasets. Compared to another state-of-the-art data structure, BFT was up to two times faster to build while using about the same amount of main memory. For querying k-mers, BFT was about 52–66 times faster while using about 5.5–14.3 times less memory.
Conclusion

We present a novel succinct data structure called the Bloom Filter Trie for indexing a pan-genome as a colored de Bruijn graph. The trie stores k-mers and their colors based on a new representation of vertices that compress and index shared substrings. Vertices use basic data structures for lightweight substrings storage as well as Bloom filters for efficient trie and graph traversals. Experimental results prove better performance compared to another state-of-the-art data structure.

Availability
https://www.github.com/GuillaumeHolley/BloomFilterTrie.

Inhalt

Inhalt des Werkes

Statistik

Das PDF-Dokument wurde 12 mal heruntergeladen.

Detailsuche

Bibliotheken

Projekt

Impressum

Datenschutz

Titelaufnahme