Diagnosing the root cause of an anomaly in a complex interconnected system is a pressing problem in today’s cloud services and industrial operations. Effective root cause diagnosis calls for identifying nodes whose disrupted local mechanisms cause anomalous behavior at a target node. We propose In-Distribution Interventions (IDI), a novel algorithm that predicts root cause as nodes that meet two criteria – 1) Anomaly
root cause nodes should take on anomalous values; 2) Fix
had the root cause nodes assumed usual values, the target node would not have been anomalous. Prior methods of assessing the fix condition rely on counterfactuals inferred from a Structural Causal Model (SCM) trained on historical data. But since anomalies are rare and fall outside the training distribution, the fitted SCMs yield unreliable counterfactual estimates. IDI overcomes this by relying on interventional estimates obtained by solely probing the fitted SCM at in-distribution inputs. Our theoretical analysis demonstrates that IDI’s in-distribution intervention approach outperforms other counterfactual estimation methods whenever variance of the underlying latent exogenous variables is low. Experiments on both synthetic and Petshop RCD benchmark datasets demonstrate that IDI consistently identifies true root causes more accurately and robustly than nine existing state-of-the-art RCD baselines.