Summary
Checkpointing is a crucial feature for ensuring the reliability and fault tolerance of complex computational tasks. NVIDIA’s cuda-checkpoint
utility, combined with the open-source checkpointing tool CRIU (Checkpoint/Restore in Userspace), provides a powerful solution for transparently checkpointing and restoring CUDA applications on Linux. This article delves into the details of how cuda-checkpoint
works, its capabilities, and how it can be used in conjunction with CRIU to achieve robust checkpointing for CUDA applications.
Understanding Checkpointing
Checkpointing is a technique used to save the state of a running application at specific intervals. This allows for the recovery of the application from the last checkpoint in case of a failure, reducing the loss of computational work. There are different types of checkpointing, including virtual machine checkpointing and application-driven checkpointing. Transparent, per-process checkpointing, as provided by cuda-checkpoint
and CRIU, offers a balance between these two approaches.
CRIU: The Foundation
CRIU is an open-source checkpointing utility for Linux that can checkpoint and restore process trees. It handles various kernel mode resources such as anonymous memory, threads, regular files, sockets, and pipes. However, CRIU lacks native support for NVIDIA GPUs, which is where cuda-checkpoint
comes into play.
cuda-checkpoint: Extending CRIU’s Capabilities
The cuda-checkpoint
utility is designed to work with CRIU to checkpoint and restore the CUDA state of a single Linux process. It supports display driver version 550 and higher and can toggle the CUDA state of a process between suspended and running. The transition from running to suspended is termed as a suspend, while the reverse is termed as a resume.
Suspend and Resume Operations
During suspension, CUDA driver APIs are locked, submitted CUDA work is completed, device memory is copied to the host, and all CUDA GPU resources are released. Conversely, during resumption, GPUs are re-acquired, device memory and GPU memory mappings are restored, CUDA objects are reinstated, and CUDA driver APIs are unlocked.
Checkpointing Example: The counter Application
An example application, counter
, demonstrates the checkpointing process. The application increments GPU memory upon receiving a packet and replies with the updated value. Users can build this application using nvcc
and observe the checkpointing and restoration processes using cuda-checkpoint
and CRIU commands.
Functionality and Limitations
As of display driver version 550, the cuda-checkpoint
utility is still under active development. It supports x64 architecture and acts on a single process rather than a process tree. It does not support UVM or IPC memory, GPU migration, and waits for already-submitted CUDA work to finish before completing a checkpoint. Future driver releases are expected to address these limitations without requiring updates to the utility itself.
Using cuda-checkpoint with CRIU
To use cuda-checkpoint
with CRIU, follow these steps:
- Suspend CUDA State: Use
cuda-checkpoint
to suspend the CUDA state of the target process. - Checkpoint with CRIU: Use CRIU to checkpoint the process.
- Restore with CRIU: Use CRIU to restore the process.
- Resume CUDA State: Use
cuda-checkpoint
to resume the CUDA state of the restored process.
Example Commands
- Suspend CUDA State:
cuda-checkpoint --toggle --pid $PID
- Checkpoint with CRIU:
criu dump --shell-job --images-dir demo --tree $PID
- Restore with CRIU:
criu restore --shell-job --restore-detached --images-dir demo
- Resume CUDA State:
cuda-checkpoint --toggle --pid $PID
Table: Key Features of cuda-checkpoint
Feature | Description |
---|---|
Transparent Checkpointing | Balances between virtual machine checkpointing and application-driven checkpointing. |
CRIU Integration | Works with CRIU to checkpoint and restore CUDA applications. |
Suspend and Resume | Toggles CUDA state between suspended and running. |
x64 Support | Currently supports x64 architecture. |
Single Process | Acts on a single process rather than a process tree. |
Limitations | Does not support UVM or IPC memory, GPU migration. |
Table: Steps for Using cuda-checkpoint with CRIU
Step | Command | Description |
---|---|---|
1. Suspend CUDA State | cuda-checkpoint --toggle --pid $PID |
Suspend CUDA state of the target process. |
2. Checkpoint with CRIU | criu dump --shell-job --images-dir demo --tree $PID |
Checkpoint the process using CRIU. |
3. Restore with CRIU | criu restore --shell-job --restore-detached --images-dir demo |
Restore the process using CRIU. |
4. Resume CUDA State | cuda-checkpoint --toggle --pid $PID |
Resume CUDA state of the restored process. |
Conclusion
The cuda-checkpoint
utility, combined with CRIU, provides a robust solution for transparently checkpointing and restoring CUDA applications on Linux. By understanding how cuda-checkpoint
works and how it can be used in conjunction with CRIU, developers can achieve fault tolerance and reliability in complex computational tasks. As the utility continues to evolve, it is expected to address current limitations and provide even greater flexibility and reliability for CUDA applications.