Add best-effort cleanup to EmrCreateJobFlowOperator on post-creation failure #61010
+126
−36
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Added best-effort cleanup to
EmrCreateJobFlowOperatorto terminate EMR clusters when failures occur after successful cluster creation.In certain failure modes, the operator could previously create a cluster via
create_job_flowand then fail during later execution steps (for example, while waiting for completion whenDescribeClusterpermissions are missing). In these cases, the task failed while leaving the cluster running. The operator now attempts to terminate the created job flow if an exception is raised after creation. Cleanup is best-effort and does not override or mask the original exception.This change applies the same failure-handling approach recently introduced for
EC2CreateInstanceOperatorin PR #60904.Rationale
EmrCreateJobFlowOperatoris responsible for provisioning and coordinating an external, stateful service whose lifecycle extends beyond task execution. If the task fails after cluster creation, Airflow can no longer reliably manage or observe the cluster’s state. Adding opportunistic cleanup in these scenarios reduces the risk of orphaned EMR clusters and unexpected infrastructure costs, while preserving existing failure semantics. Cleanup errors are logged and do not affect the task’s final failure state.Tests
Backwards Compatibility
No changes to the public API or operator parameters.
Reproduciblity
The failure scenario could not be reproduced directly due to personal AWS account permissions. However, based on the current control flow of
EmrCreateJobFlowOperator, it is possible for cluster creation to succeed while a later step fails, leaving the EMR cluster running without cleanup. This change defensively addresses that case. Contributors reading this PR are free to provide a reproduction for the aforementioned failure mode if they can.