Skip to content

Conversation

@gurasinghMS
Copy link
Contributor

Clean cherry pick of PR #2658

When the AdminAerHandler enters a failed state, it stops responding to get_next_aen requests. As a result, the driver’s handle_asynchronous_events loop becomes stuck—no errors are surfaced, the function never fails, and we lose both the original failure context and any indication that AER processing has halted.
This PR introduces clearer failure semantics for the AdminAerHandler. When the handler fails, it now records the most recent error and returns that error for any in‑flight or future AEN requests. This ensures that the handle async event loop receives a definitive failure signal instead of waiting indefinitely.
With this change, it becomes the responsibility of handle_asynchronous_events loop to avoid repeatedly issuing AEN requests once a failure is reported. Longer term, this structure sets us up for a more robust AER‑restart strategy—where the handler may fail but can be restarted under specific conditions, with the async event task deciding when a restart is appropriate.

Current:
Requestaen1

Updated:
Requestaen2

… status messages when in a failed state (microsoft#2658)

When the `AdminAerHandler` enters a failed state, it stops responding to
`get_next_aen` requests. As a result, the driver’s
`handle_asynchronous_events` loop becomes stuck—no errors are surfaced,
the function never fails, and we lose both the original failure context
and any indication that AER processing has halted.
This PR introduces clearer failure semantics for the `AdminAerHandler`.
When the handler fails, it now records the most recent error and returns
that error for any in‑flight or future AEN requests. This ensures that
the handle async event loop receives a definitive failure signal instead
of waiting indefinitely.
With this change, it becomes the responsibility of
`handle_asynchronous_events` loop to avoid repeatedly issuing AEN
requests once a failure is reported. Longer term, this structure sets us
up for a more robust AER‑restart strategy—where the handler may fail but
can be restarted under specific conditions, with the async event task
deciding when a restart is appropriate.

Current:

![Requestaen1](https://github.com/user-attachments/assets/ce7c9fe2-2564-47f3-9107-eb0a9f92090c)

Updated:

![Requestaen2](https://github.com/user-attachments/assets/5d6f44bc-0eba-4a27-a7e1-c1746038b566)

(cherry picked from commit e23b493)
@gurasinghMS gurasinghMS requested a review from a team as a code owner January 23, 2026 22:44
Copilot AI review requested due to automatic review settings January 23, 2026 22:44
@gurasinghMS gurasinghMS requested a review from a team as a code owner January 23, 2026 22:44
@github-actions github-actions bot added the release_1.7.2511 Targets the release/1.7.2511 branch. label Jan 23, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical issue in the NVMe driver's Asynchronous Event Request (AER) handler where failed completions would cause the handler to silently stop responding, leaving the handle_asynchronous_events loop indefinitely waiting. The fix introduces proper failure semantics: when the AdminAerHandler encounters a failed completion, it now records the error status and returns that error for any in-flight or future AEN requests, allowing the driver's async event loop to detect the failure and exit gracefully.

Changes:

  • Modified AdminAerHandler to track failure status and propagate errors to waiting AEN requests
  • Updated RPC signatures throughout the AER handling chain to support error results
  • Added comprehensive unit test to verify correct failure handling behavior

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
vm/devices/storage/disk_nvme/nvme_driver/src/tests.rs Adds unit test verifying that failed AER completions are properly reported and prevent further AER issuance
vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs Updates AER handler to store failure status, propagate errors through RPC channels, and stop issuing AERs after failures

@mattkur mattkur merged commit 61d5e8a into microsoft:release/1.7.2511 Jan 24, 2026
52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release_1.7.2511 Targets the release/1.7.2511 branch.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants