TY - GEN
T1 - Constructing similarity graphs from large-scale biological sequence collections
AU - Zola, Jaroslaw
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2014/11/27
Y1 - 2014/11/27
N2 - Detecting similar pairs in large biological sequence collections is one of the most commonly performed tasks in computational biology. With the advent of high throughput sequencing technologies the problem regained significance as data sets with millions of sequences became ubiquitous. This paper is an initial report on our parallel, distributed memory and sketching-based approach to constructing large-scale sequence similarity graphs. We develop load balancing techniques, derived from multi-way number partitioning and work stealing, to manage computational imbalance and ensure scalability on thousands of processors. Our experimental results show that the method is efficient, and can be used to analyze data sets with millions of DNA sequences in acceptable time limits.
AB - Detecting similar pairs in large biological sequence collections is one of the most commonly performed tasks in computational biology. With the advent of high throughput sequencing technologies the problem regained significance as data sets with millions of sequences became ubiquitous. This paper is an initial report on our parallel, distributed memory and sketching-based approach to constructing large-scale sequence similarity graphs. We develop load balancing techniques, derived from multi-way number partitioning and work stealing, to manage computational imbalance and ensure scalability on thousands of processors. Our experimental results show that the method is efficient, and can be used to analyze data sets with millions of DNA sequences in acceptable time limits.
KW - Load balancing
KW - Min-wise independent permutations
KW - Parallel computational biology
KW - Sequence similarity
UR - https://www.scopus.com/pages/publications/84918830224
U2 - 10.1109/IPDPSW.2014.63
DO - 10.1109/IPDPSW.2014.63
M3 - Conference contribution
AN - SCOPUS:84918830224
T3 - Proceedings - IEEE 28th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2014
SP - 500
EP - 507
BT - Proceedings - IEEE 28th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2014
PB - IEEE Computer Society
T2 - 28th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2014
Y2 - 19 May 2014 through 23 May 2014
ER -