Today, the eXtreme Computing Research (XCR) group at Louisiana Tech University announced a breakthrough development in the RAS-ware runtime for transparent job queue fault tolerance in HPC Cluster environment.
Dr. Box Leangsuksun, an associate professor in computer science, explains XCR group's recent breakthrough consists of High Availability, Self-configuration, and Self-healing as enabling solutions. His group of graduate students, led by Anand Tikotekar, has implemented a proof-of-concept Beowulf cluster based on HA-OSCAR 1.1 and standard HPC resource management/job queue system (e.g PBS/TORUE). Preliminary results suggest that MPI jobs can continue their execution and job queue is preserved regardless of failures at the head node and compute nodes. The experiment runs standard MPI jobs without any modification under LAM/MPI 7.0. The breakthrough handles both running and queued jobs transparently and the queue order is maintained in a catastrophic failure. HA-OSCAR multi-head solution provides a capability to failover and transparently recovers the job queue in a head-node outage event.
"This is very exciting for us," said Leangsuksun. "This marks a major milestone in our overarching goal - toward non-stop services in HPC environment. We expect that our breakthrough technology is exactly what the community has been waiting for."
Leangsuksun continued, "Our breakthrough is also expected to be part of the next HA-OSCAR release that will have broad impacts in HPC and telecomm cluster environments, especially for mission critical applications."
The demo will be shown at SC05 in booth #218.
HA-OSCAR is an open source project. Dr. "Box" Chokchai Leangsuksun is the chief architect and project director of the HA-OSCAR research and development program at Louisiana Tech University. This project is collaboration between the eXtreme Computing Research (XCR) group at Louisiana Tech University and the Network and Cluster Computing (NCC) group at Oak Ridge National Laboratory (ORNL). The research and development program is supported and funded by Office of Science, Department of Energy contract DE-FG02-05ER25659. More information can be obtained at http://xcr.cenit.latech.edu/ha-oscar