nvme_driver: admin aer handler should respond with proper failure and status messages when in a failed state (#2658) #2688

gurasinghMS · 2026-01-23T22:44:27Z

Clean cherry pick of PR #2658

When the AdminAerHandler enters a failed state, it stops responding to get_next_aen requests. As a result, the driver’s handle_asynchronous_events loop becomes stuck—no errors are surfaced, the function never fails, and we lose both the original failure context and any indication that AER processing has halted.
This PR introduces clearer failure semantics for the AdminAerHandler. When the handler fails, it now records the most recent error and returns that error for any in‑flight or future AEN requests. This ensures that the handle async event loop receives a definitive failure signal instead of waiting indefinitely.
With this change, it becomes the responsibility of handle_asynchronous_events loop to avoid repeatedly issuing AEN requests once a failure is reported. Longer term, this structure sets us up for a more robust AER‑restart strategy—where the handler may fail but can be restarted under specific conditions, with the async event task deciding when a restart is appropriate.

Current:

Updated:

… status messages when in a failed state (microsoft#2658) When the `AdminAerHandler` enters a failed state, it stops responding to `get_next_aen` requests. As a result, the driver’s `handle_asynchronous_events` loop becomes stuck—no errors are surfaced, the function never fails, and we lose both the original failure context and any indication that AER processing has halted. This PR introduces clearer failure semantics for the `AdminAerHandler`. When the handler fails, it now records the most recent error and returns that error for any in‑flight or future AEN requests. This ensures that the handle async event loop receives a definitive failure signal instead of waiting indefinitely. With this change, it becomes the responsibility of `handle_asynchronous_events` loop to avoid repeatedly issuing AEN requests once a failure is reported. Longer term, this structure sets us up for a more robust AER‑restart strategy—where the handler may fail but can be restarted under specific conditions, with the async event task deciding when a restart is appropriate. Current: ![Requestaen1](https://github.com/user-attachments/assets/ce7c9fe2-2564-47f3-9107-eb0a9f92090c) Updated: ![Requestaen2](https://github.com/user-attachments/assets/5d6f44bc-0eba-4a27-a7e1-c1746038b566) (cherry picked from commit e23b493)

Copilot

Pull request overview

This PR fixes a critical issue in the NVMe driver's Asynchronous Event Request (AER) handler where failed completions would cause the handler to silently stop responding, leaving the handle_asynchronous_events loop indefinitely waiting. The fix introduces proper failure semantics: when the AdminAerHandler encounters a failed completion, it now records the error status and returns that error for any in-flight or future AEN requests, allowing the driver's async event loop to detect the failure and exit gracefully.

Changes:

Modified AdminAerHandler to track failure status and propagate errors to waiting AEN requests
Updated RPC signatures throughout the AER handling chain to support error results
Added comprehensive unit test to verify correct failure handling behavior

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
vm/devices/storage/disk_nvme/nvme_driver/src/tests.rs	Adds unit test verifying that failed AER completions are properly reported and prevent further AER issuance
vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs	Updates AER handler to store failure status, propagate errors through RPC channels, and stop issuing AERs after failures

gurasinghMS requested a review from a team as a code owner January 23, 2026 22:44

Copilot AI review requested due to automatic review settings January 23, 2026 22:44

gurasinghMS requested a review from a team as a code owner January 23, 2026 22:44

github-actions bot added the release_1.7.2511 Targets the release/1.7.2511 branch. label Jan 23, 2026

Copilot started reviewing on behalf of gurasinghMS January 23, 2026 22:44 View session

Copilot AI reviewed Jan 23, 2026

View reviewed changes

Merge branch 'release/1.7.2511' into cherrypick/release/1.7.2511/pr-2658

ca6fb86

mattkur approved these changes Jan 24, 2026

View reviewed changes

mattkur merged commit 61d5e8a into microsoft:release/1.7.2511 Jan 24, 2026
52 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvme_driver: admin aer handler should respond with proper failure and status messages when in a failed state (#2658) #2688

nvme_driver: admin aer handler should respond with proper failure and status messages when in a failed state (#2658) #2688

gurasinghMS commented Jan 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nvme_driver: admin aer handler should respond with proper failure and status messages when in a failed state (#2658) #2688

nvme_driver: admin aer handler should respond with proper failure and status messages when in a failed state (#2658) #2688

Conversation

gurasinghMS commented Jan 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants