Checkpointing CUDA Applications with CRIU

Summary

Checkpointing is a crucial feature for ensuring the reliability and fault tolerance of complex computational tasks. NVIDIA’s cuda-checkpoint utility, combined with the open-source checkpointing tool CRIU (Checkpoint/Restore in Userspace), provides a powerful solution for transparently checkpointing and restoring CUDA applications on Linux. This article delves into the details of how cuda-checkpoint works, its capabilities, and how it can be used in conjunction with CRIU to achieve robust checkpointing for CUDA applications.

Understanding Checkpointing

Checkpointing is a technique used to save the state of a running application at specific intervals. This allows for the recovery of the application from the last checkpoint in case of a failure, reducing the loss of computational work. There are different types of checkpointing, including virtual machine checkpointing and application-driven checkpointing. Transparent, per-process checkpointing, as provided by cuda-checkpoint and CRIU, offers a balance between these two approaches.

CRIU: The Foundation

CRIU is an open-source checkpointing utility for Linux that can checkpoint and restore process trees. It handles various kernel mode resources such as anonymous memory, threads, regular files, sockets, and pipes. However, CRIU lacks native support for NVIDIA GPUs, which is where cuda-checkpoint comes into play.

cuda-checkpoint: Extending CRIU’s Capabilities

The cuda-checkpoint utility is designed to work with CRIU to checkpoint and restore the CUDA state of a single Linux process. It supports display driver version 550 and higher and can toggle the CUDA state of a process between suspended and running. The transition from running to suspended is termed as a suspend, while the reverse is termed as a resume.

Suspend and Resume Operations

During suspension, CUDA driver APIs are locked, submitted CUDA work is completed, device memory is copied to the host, and all CUDA GPU resources are released. Conversely, during resumption, GPUs are re-acquired, device memory and GPU memory mappings are restored, CUDA objects are reinstated, and CUDA driver APIs are unlocked.

Checkpointing Example: The counter Application

An example application, counter, demonstrates the checkpointing process. The application increments GPU memory upon receiving a packet and replies with the updated value. Users can build this application using nvcc and observe the checkpointing and restoration processes using cuda-checkpoint and CRIU commands.

Functionality and Limitations

As of display driver version 550, the cuda-checkpoint utility is still under active development. It supports x64 architecture and acts on a single process rather than a process tree. It does not support UVM or IPC memory, GPU migration, and waits for already-submitted CUDA work to finish before completing a checkpoint. Future driver releases are expected to address these limitations without requiring updates to the utility itself.

Using cuda-checkpoint with CRIU

To use cuda-checkpoint with CRIU, follow these steps:

Suspend CUDA State: Use cuda-checkpoint to suspend the CUDA state of the target process.
Checkpoint with CRIU: Use CRIU to checkpoint the process.
Restore with CRIU: Use CRIU to restore the process.
Resume CUDA State: Use cuda-checkpoint to resume the CUDA state of the restored process.

Example Commands

Suspend CUDA State: cuda-checkpoint --toggle --pid $PID
Checkpoint with CRIU: criu dump --shell-job --images-dir demo --tree $PID
Restore with CRIU: criu restore --shell-job --restore-detached --images-dir demo
Resume CUDA State: cuda-checkpoint --toggle --pid $PID

Table: Key Features of cuda-checkpoint

Feature	Description
Transparent Checkpointing	Balances between virtual machine checkpointing and application-driven checkpointing.
CRIU Integration	Works with CRIU to checkpoint and restore CUDA applications.
Suspend and Resume	Toggles CUDA state between suspended and running.
x64 Support	Currently supports x64 architecture.
Single Process	Acts on a single process rather than a process tree.
Limitations	Does not support UVM or IPC memory, GPU migration.

Table: Steps for Using cuda-checkpoint with CRIU

Step	Command	Description
1. Suspend CUDA State	`cuda-checkpoint --toggle --pid $PID`	Suspend CUDA state of the target process.
2. Checkpoint with CRIU	`criu dump --shell-job --images-dir demo --tree $PID`	Checkpoint the process using CRIU.
3. Restore with CRIU	`criu restore --shell-job --restore-detached --images-dir demo`	Restore the process using CRIU.
4. Resume CUDA State	`cuda-checkpoint --toggle --pid $PID`	Resume CUDA state of the restored process.

Conclusion

The cuda-checkpoint utility, combined with CRIU, provides a robust solution for transparently checkpointing and restoring CUDA applications on Linux. By understanding how cuda-checkpoint works and how it can be used in conjunction with CRIU, developers can achieve fault tolerance and reliability in complex computational tasks. As the utility continues to evolve, it is expected to address current limitations and provide even greater flexibility and reliability for CUDA applications.

Understanding Checkpointing#

CRIU: The Foundation#

cuda-checkpoint: Extending CRIU’s Capabilities#

Suspend and Resume Operations#

Checkpointing Example: The counter Application#

Functionality and Limitations#

Using cuda-checkpoint with CRIU#

Example Commands#

Table: Key Features of cuda-checkpoint#

Table: Steps for Using cuda-checkpoint with CRIU#

Conclusion#