The provision of HPC resources requires more than system administration. It takes a team of skilled analysts able to work with user codes and make them run more efficiently with large numbers of processors. These analysts must work closely with both the administrators and clients to ensure that jobs run smoothly and costly resources are used effectively.
High performance computing (HPC) is a field directed toward processing tasks that require significant computing resources in a relatively short time, a period of hours or days, typically. The design of the hardware is geared toward tightly-coupled parallel tasks. Clusters with low-latency interconnects and special purpose interconnect hardware in symmetric multiprocessing (SMP) systems, are the norm in these environments. The software that runs on these systems is ideally written specifically to efficiently utilize both large numbers of cores and massive memory.
A High Performance Computing environment is characterized by the almost constant high loads on the systems. As the scheduling software adds additional jobs to the system, this leads to large changes in demands on the types of resources needed by the user jobs. These new jobs can potentially make the system unstable and impact the performance of other users’ jobs. This environment requires monitoring and frequent intervention by system analysts. Every job submitted to the system can potentially change the dynamics enough to make other users’ jobs fail. In an extreme case the entire system may crash. When this happens, system analysts and administrators must investigate and take measures to prevent this from happening in the future; while at the same time bringing the system back into production in a timely fashion.
Clients may require special assistance to run their jobs at a higher than normal priority to meet academic or business deadlines. This can be accomplished, but it requires manual intervention by system administrators. Administrators need to be familiar with, and sensitive to all users’ needs. Short term priority changes can be made for special situations.
In contrast, a typical non-HPC server, for mail or web serving for example, tends to be a single application device. Once configured properly, these types os machines can run for extended periods with downtime for regular maintenance and the occasional hardware repair.
Providing an HPC service requires a team working closely together to meet user needs and keep the system running at peak efficiency. This environment requires that skilled analysts are constantly learning new tools and techniques relevant to current and future clients.