Robust Root Cause Diagnosis using In-Distribution Interventions

Abstract

Diagnosing the root cause of an anomaly in a complex interconnected system is a pressing problem in today’s cloud services and industrial operations. Effective root cause diagnosis calls for identifying nodes whose disrupted local mechanisms cause anomalous behavior at a target node. We propose In-Distribution Interventions (IDI), a novel algorithm that predicts root cause as nodes that meet two criteria – 1) Anomaly root cause nodes should take on anomalous values; 2) Fix had the root cause nodes assumed usual values, the target node would not have been anomalous. Prior methods of assessing the fix condition rely on counterfactuals inferred from a Structural Causal Model (SCM) trained on historical data. But since anomalies are rare and fall outside the training distribution, the fitted SCMs yield unreliable counterfactual estimates. IDI overcomes this by relying on interventional estimates obtained by solely probing the fitted SCM at in-distribution inputs. Our theoretical analysis demonstrates that IDI’s in-distribution intervention approach outperforms other counterfactual estimation methods whenever variance of the underlying latent exogenous variables is low. Experiments on both synthetic and Petshop RCD benchmark datasets demonstrate that IDI consistently identifies true root causes more accurately and robustly than nine existing state-of-the-art RCD baselines.

Publication
International Conference on Learning Representations (ICLR)

Related