CADE — An Interesting Approach to Finding Anomalies in Multidimensional Data

This is a translation of my original article on habr.com.

Introduction

One way to search for anomalies in a dataset is to use the probability density function corresponding to the data as a measure of how "anomalous" points are. The intuition is simple: the lower the value of the probability density at a point, the more "anomalous" that point is. Conversely, regions with high density values correspond to sets of more "normal" points. Thus, if we know the probability density function, then, under one possible formulation of the problem, we can identify anomalies as well.

The difficulty arises in obtaining (or, in real-life situations, approximating) the probability density function, especially when working with high-dimensional data, where standard density estimation methods become computationally expensive. More sophisticated approaches can help approximate the probability density. There is an entire class of such methods, and one of them particularly impressed me with the originality of its idea. It seemed interesting enough to present in a more accessible form on Habr.

This method is called CADE (Classifier Adjusted Density Estimation), and it was first described in 2001 in paper [1], although its true potential was explored later.

CADE is a density estimation method that performs well in high-dimensional settings and in the presence of uninformative features. Importantly, for correct operation it assumes feature independence. However, according to paper [2], the method ranks points by anomalousness at the same level as, and under some conditions even better than, LOF (Local Outlier Factor), one of the popular density-based anomaly detection methods.

The core idea of CADE is to use any classification algorithm you like, trained to distinguish known data from artificially generated points. But how can this help us approximate the probability density function? Let's take a look.

How CADE Works

In short, CADE proposes the following:

Mix the original data with data sampled from an arbitrary (but known beforehand) distribution.
Train a classifier capable of distinguishing points from the two distributions, i.e., solve a binary classification problem.
Use the classifier's predictions in an analytically derived formula to recover the density function of the original data.

How does this work? Let's find out.

I will use the notation from paper [2], which served as the basis for this article, so that it will be easier for you to read the original paper if you wish.

Let $T$ denote the available data (containing a small fraction of anomalies). Let $P(X \mid T)$ be its probability density function, which we want to approximate.

The first thing we need to do is choose the aforementioned "arbitrary" distribution from which we can generate data. Let us denote it by $A$, and its density function by $P(X \mid A)$.

The authors state that uniform and normal distributions work well for this purpose, although there are no restrictions on the choice of distribution. The only desirable property is that it fully covers the original data, meaning there is a nonzero probability of generating a point close to the original observations.

As for sample size, the authors used a generated sample equal in size to $T$, although I will share some thoughts on this later. Points generated from $A$ are sometimes referred to as artificial anomalies.

The next step is to choose a classifier. We can use any model capable of estimating the probability that a point belongs to the original class $T$, namely:

$$P(C = T \mid X)$$

Suitable classifiers include Random Forest Classifier, KNN, Decision Tree Classifier, an appropriately designed neural network, and so on.

Naturally, for every point the probabilities of belonging to class $T$ and class $A$ sum to one. In other words:

$$P(C=T \mid X)=1-P(C=A \mid X)$$

Now let us derive the formula that allows us to use the classifier's prediction to approximate the original density function $P(X \mid T)$.

Applying Bayes' theorem to $P(C=T \mid X)$, we obtain:

$$P(C=T \mid X)=\frac{P(X \mid T)P(C=T)} {P(X)}=\frac{P(X \mid T)P(C=T)} {P(X \mid T)P(C=T)+P(X \mid A)P(C=A)}$$

Solving this equation for $P(X \mid T)$ yields:

$$P(X \mid T)=\frac{P(C=A)}{P(C=T)} P(X \mid A) \frac{P(C=T \mid X)} {1-P(C=T \mid X)}$$

This is the CADE approximation of the target density function. It consists of three main factors.

1. The class-prior ratio

This can be approximated by the ratio of the number of generated points to the number of original data points:

$$\frac{P(C=A)}{P(C=T)}$$

2. The known density of the generated data

$$P(X \mid A)$$

3. The ratio computed from classifier predictions

$$\frac{P(C=T \mid X)} {1-P(C=T \mid X)}$$

If $P(X \mid A)$ is constant, meaning that $A$ was chosen as a uniform distribution, then ranking points by density values will coincide with ranking them by classifier scores (multiplication by a positive constant does not affect the ordering).

If, on the other hand, we happen to choose $P(X \mid A)$ close to $P(X \mid T)$, then the classifier will be unable to distinguish between the distributions and will predict a value of 0.5 for all points. In that case, the ranking will naturally coincide with the ranking induced by $P(X \mid A)$.

Let's leave the formulas behind and look at a small animation illustrating the CADE process without computations or the final formula.

As mentioned above, the researchers in [2] demonstrated that the method performs well on high-dimensional data and in the presence of noisy attributes. They also showed that incorporating the factor $P(X \mid A)$, which is the distinguishing feature of CADE, improves performance when features are correlated.

If you are interested, I recommend reading the results presented in that study. You will also find ROC AUC values reported for different datasets and classifiers.

Using CADE in Practice

At the time of writing, I was unable to find a Python implementation of CADE (not to be confused with another CADE—Compass-aligned Distributional Embeddings—which does have a PyPI implementation). Therefore, in this article, as well as on GitHub (cade-outliers), I provide example code implementing anomaly detection using this method.

Suggestions for improving the code are welcome.

The CADEOutliers class is responsible for approximating the density function of the original data and returning points ranked by density values.

import numpy as np


class CADEOutliers():
    """Class for outliers detection with CADE (Classifier Adjusted Density Estimation)"""
    
    _supported_A_dist = ["uniform"]
    
    def __init__(self, classifier, A_dist, A_size):
        """Create CADE outlier detection object.
        
        Parameters
        ----------
        classifier : model
            sklearn classifier model supporting fit and predict_proba methods.
        
        A_dist : string
            Distribution for generating artificial anomalies.
        
        A_size : float | int
            If float, then the number of artifical anomalies is |X| * A_size, where X - analyzed data. 
            If int, then the number of artifical anomalies is just A_size.
        """
        # Set classifier
        self.classifier = classifier
        
        # Set the distribution kind
        if A_dist not in self._supported_A_dist:
            raise ValueError(f"{A_dist} distribution is not supported yet for generating artifical anomalies. Choose from the following list: {self._supported_A_dist}")
        self.A_dist = A_dist
        
        # Set the number of artifical anomalies to generate
        if isinstance(A_size, int) and A_size >= 1:
            self.sample_size = A_size
        elif isinstance(A_size, float) and A_size >= 0:
            self.sample_size = None
        else:
            raise ValueError("A_size must be either int[1, inf) of float[0, inf)")
        self.A_size = A_size
        
    def _generate_A(self, dist, attrs):
        """Generate artifical anomalies.
        
        Parameters
        ----------
        dist : string
            Distribution for generating artificial anomalies.
        attrs : dict
            Key-value arguments for generating artifical anomalies. 
            If dist is "uniform", attrs = {'low': array, 'high': array, 'size': (sample_size, n_dim)} with min and high thresholds for each dimension.
        """
        if dist == 'uniform':
            generator = np.random.uniform
        else:
            raise ValueError(f"{dist} distribution is not supported yet for generating artifical anomalies. Choose from the following list: {self._supported_A_dist}")
        
        A = generator(**attrs)
        
        return A
    
    def fit(self, X):
        """Fit the distribution of X.
        
        Parameters
        ----------
        X : np.ndarray
            Array with samples
        """
        if self.sample_size is None:
            sample_size = int(len(X) * self.A_size) + 1
        else:
            sample_size = int(self.sample_size)
        
        if self.A_dist == 'uniform':
            n_dim = X.shape[1]
            attrs = {
                'low': X.min(axis=0),
                'high': X.max(axis=0),
                'size': (sample_size, n_dim)
            }
            # calculate uniform probability density
            P_A = 1
            for s in X.max(axis=0) - X.min(axis=0):
                if s == 0:
                    s = 1
                P_A /= s
            self.P_A = lambda x: np.array([P_A] * x.shape[0])
        else:
            attrs = None
            
        A = self._generate_A(self.A_dist, attrs)
        
        combined_data = np.vstack([X, A])
        target = np.hstack([np.zeros(X.shape[0]), np.ones(A.shape[0])])
        
        self.classifier.fit(combined_data, target)
        self.A_X_ratio = A.shape[0] / X.shape[0]
        
        
    def get_ranking(self, X):
        """Return rankings for each sample in X.
        Higher values indicate higher certainty that the point is an outlier.
        
        Parameters
        ----------
        X : np.ndarray
            Array with samples
        """
        predictions = self.classifier.predict_proba(X)[:, 0]
        
        anomaly_score = self.P_A(X) * self.A_X_ratio * (predictions / (1 - predictions + 1e-5))
        
        return anomaly_score

Generating data and applying CADE:

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

import CADEOutliers


# Generate data
# X consists of a mix of 3 normal distributions
# The 3rd is a small portion of anomalies
X = np.vstack([2 * np.random.randn(5000, 2), 7 + np.random.randn(3000, 2), 25 + np.random.randn(100, 2)])
y = np.hstack([np.zeros(8000), np.ones(100)])

print('The shape of X: {}'.format(X.shape))

# Split the data into train and test part
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

# Create CADE outliers object with uniform distribution of artifical anomalies and with size of 100% of the given dataset
cade = CADEOutliers(
    classifier=RandomForestClassifier(max_depth=3), 
    A_dist='uniform', 
    A_size=1.
)

# Fit probability density of X using X_train
cade.fit(X_train)

# Get rankings for X_test. Lower values indicate higher similarity to outliers
ranking_by_dens = cade.get_ranking(X_test)

print('ROC AUC of ranking anomalies by CADE on test data: {}'.format(roc_auc_score(y_test, -ranking_by_dens)))

fig, ax = plt.subplots(1, 1, figsize=(7, 7))
sns.scatterplot(x=X_test[:, 0], y=X_test[:, 1], hue=ranking_by_dens, ax=ax, palette='magma')
plt.title('Points with rankings derived with CADE (low values indicate anomalies)')
plt.show()

A few observations that may be useful:

An overfitted classifier is more likely to consider any points seen during training as "normal" (even those that appear anomalous), while treating any regions without training data as anomalous (even if they lie close to observed points). In other words, classifier overfitting leads to a more detailed description of the probability density and greater sensitivity in anomaly detection.
A higher density of artificial anomalies filling the feature space also leads to a more detailed representation of the density function, producing essentially the same effect as classifier overfitting.
According to [2], Random Forest Classifier and KNN work well as classifiers, while a uniform distribution works well as the source of artificial anomalies.

I will be glad if you found CADE interesting as well. Especially if you decide to apply it in practice and share your observations about how it performs. Feel free to share your thoughts in the comments!

References

[1] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer-Verlag, 2001.

[2] https://seppe.net/aa/papers/CADE.pdf

CADE — An Interesting Approach to Finding Anomalies in Multidimensional Data

Introduction

How CADE Works

Using CADE in Practice

References

Comments

More from this blog

Automatically Extracting Piecewise-Linear Trends from a Time Series

How and Why to Compute a Balanced Precision Score

What Is the Distribution of Sample Quantiles?

Command Palette

Introduction

How CADE Works

Using CADE in Practice

References

Comments

More from this blog