| MS Thesis Defense by Abhinav Thota |
Presentation Title: Efficient Replica-Exchange Across Distributed Production Infrastructure
Committee:
Time: 2:00 PM Location:Room 331, Johnston Hall Abstract: Replica-Exchange (RE) methods represent a class of algorithms that involve a large number of loosely coupled ensembles and are used to understand physical phenomena -- ranging from protein folding dynamics to binding affinity calculations. We develop a framework for RE that supports different replica pairing and coordination mechanisms, that can use a range of production cyberinfrastructure concurrently. Additionally, our framework uses a flexible pilot-job implementation, which enables effective resource allocation for multiple replicas. We characterize the performance of two different RE algorithms - synchronous and asynchronous - at unprecedented scales on production-distributed infrastructure. The synchronous RE algorithm is only implemented in a centralized manner; while the asynchronous RE algorithm is implemented in both centralized and decentralized manners. We evaluate the performance of the different algorithms and implementations when we scale-up the number of replicas (up to 256) on a single machine and when we scale-out across 2 and 4 machines. Both the synchronous and asynchronous algorithms perform similarly when the number of replicas is small. But as the number of replicas increase, in the synchronous RE, the synchronization cost increases the total time to completion. In the centralized asynchronous RE, the cost of managing many replicas in a centralized manner increases the time to completion but not as much as in the synchronous RE. The decentralized asynchronous RE scales much better with increasing number of replicas. When scaled-out across many machines, the performance of synchronous RE depends on whether the machines are homogeneous or heterogeneous. A heterogeneous infrastructure means increased synchronization costs. The asynchronous RE is not affected as much as the synchronous RE when scaled-out across many machines. All are invited. |