-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
Describe the bug
When the Async IO Engine is used for the rootfs filesystem and there is a lot of io happening during pause and snapshot creation, there might be some pending operations completions (write/read completion) between the pause and snapshot FC (pending ops). When the vm is resumed later on, the kernel freezes - it's waiting for IO which never finishes.
This issue doesn't seem to happen when using the Sync IO Engine.
To Reproduce
Here is a test case that reproduces the issue most of the time with added debug messages for the pending ops: https://github.com/e2b-dev/firecracker/pull/6/files#diff-d960bea365831acfb0eb3b1b548e6d22293710c9ed558f5dfbf68e016457870dR595
Example error output:
Starting iteration 1/100 - Testing for non-zero async I/O drain
================================================================================
Free space on sandbox start: 18G
DRAIN: pending_ops=17
Restoring from snapshot...
(frozen, nothing happens after)
Expected behavior
The resume will succeed even when there are pending ops during the FC pause/snapshot.
Environment
- Firecracker version: 1.13.1
- Host and guest kernel versions: Guest kernels provided by the test suite, Host: Linux codespaces-2de7d3 6.8.0-1030-azure VCPU Support #35~22.04.1-Ubuntu SMP Mon May 26 18:08:30 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
- Rootfs used: Provided by the tests suite (extended size to fit the operations)
- Architecture: x86_64
- Any other relevant software versions: -
Additional context
How has this bug affected you? The resume is occasionally failing.
What are you trying to achieve? Resume a VM that has been paused previously.
Do you have any idea of what the solution might be? Not yet. My guess would be that the completions are not properly acknowledged for the guest OS.
Checks
- Have you searched the Firecracker Issues database for similar problems?
- Have you read the existing relevant Firecracker documentation?
- Are you certain the bug being reported is a Firecracker issue? Not fully sure. It might be related to the io_uring.