Tweaking (extended) Isolation Forests

TWEAKING_extended_isolation_forest_small

Introduction

This blog post focuses on the optimization process and evaluation of a single component used for outlier detection. We implemented our own forest inspired solution as part of our meta learning framework. Although our usage and applied implementation are quite different, we share the (fun part of the) journey. We will limit the scope of the blog post, and avoid manipulating the important data preprocessing part. Instead we will focus on the journey that led to our unique implementation. For readability purposes, we will break this blog post to 3 different parts. To access the full white paper on Pentoma’s Generative Adversarial Model Agnostic Networks (GAMAN) and the unique technology we created, please register below.

Background

Liu, Ting, and Zhou suggested isolation forest as an unsupervised learning method for outlier detection. iForest has linear time complexity with a low constant and a low memory requirement, which is ideal for high volume data sets.

Liu’s isolation forests however displayed a bias in branching regions over score maps, which is parallel to the axis. To solve it, Liu, Ting, and Zhou suggested SciForest (Split criterion isolation forest), and Hariri and Kind suggested a partial implementation of SciForest described as Extended Isolation Forests (EIF).

Liu’s SciForest Algorithm II (PathLength) and Hariri’s Extended Isolation forests Algorithm III (PathLength) seem similar at first. However, if you closely look at the PathLength algorithm, a boundary check on the upper and lower limits is missing from Hariri’s forest.

Algorithm2

Algorithm3

We decided to evaluate the hyperplane construction and suggested a modification on the normal vector distribution to below:

CodeCogsEqn-15

Using the original notes from Hariri et al., we compared the isolation forest with no extension to extended with and without tweaking on a multi-blob where the normal distribution selection was changed. The leftmost is the original output while the rightmost is the output of the forests with our tweak. The left figures are the tweaked versions as you can see the variance reduced for both trees with and without the extension.

Examining how the splitting criteria was affected by this minor change, we decided to evaluate it further and compare this configuration with other isolation forest implementations, namely Scikit-learn and the Author’s R module, which will be discussed further in followup blogs when we compare our matrices to Liu’s.

Other variations and implementations were also reviewed. Most of them seemed to either rely on SciForest, IsolationForest or EIF and did not seem to outperform Nicolas Groix and Alexandre Gramfort’s Scikit-learn implementation, which we chose to use as a baseline for our performance comparison, mostly because of its results consistency and efficient implementation.

Scikit-learn Adapter

In order to fit the classifier to Scikit-learn framework, we created a small adapter that passes the calls to our modified EIF implementation.

The images below show the comparisons between forests with extension and tweaks and without them.

We can see the behavior of the iForest given no extension level shows similar results to the IsolationForest implementation of Scikit. Below, we can see how the tweaking and the extensions affect the outlier detection performance, using the second example with the toy datasets from Scikit:

We noticed the considerably long time it took for the process while traversing the full set of the trees:

Scikit learn IsolationForest

EIF

Optimization

To operate on the data, we first needed to separate the condition in which the hyperplanes are calculated, and in which the values are shot through:

CodeCogsEqn-6

Although CodeCogsEqn-7 is a variable operation, which relies on the data and needs to be recalculated for each data point, CodeCogsEqn-8 can be calculated and stored per node so we can avoid calculating it for each data point. This allows us to reduce the evaluation criteria to a single dot product of CodeCogsEqn-7 , and use the pre-calculated value for the evaluation criteria since:

CodeCogsEqn-5

Benchmark

The Data

To proceed, we decided to test Scikit Isolation forests benchmarks against our tweaked, non- tweaked, extended, and non-extended forests over 5 of the 6 infamous outlier datasets (we excluded ‘shuttle’, since the Scikit implementation failed to converge on this dataset). We tested different sample split strategies although we started with a 50-50 split. We found all variations of the isolation forest to be mostly efficient on a 10% split for training and 90% for testing for the given datasets. All datasets were shuffled, but the results showed a certain consistency with very small deviations for each run. In addition, all the datasets were only 10% of the original size (i.e. the percent10 flag was on). We wanted to avoid replacing the preprocessing, even though MCMC (Monte Carlo Markov Chain) would benefit these datasets as suggested by Liu et al., we decided to leave it with the original encoding scheme used on Scikit for the scope of this blog post.

Dataset	Description	Outlier ratio	Pre-Process	Decision Function Histogram
http	KDDcup99 IDS outlier detection http dataset	0.03762	indicator
smtp	KDDcup99 IDS outlier detection smtp dataset	0.00031	indicator
SA	KDDcup99 IDS outlier SA	0.03355	LabelBinarizer
SF	KDDcup99 IDS outlier SF	0.04503	LabelBinarizer
Forest cover	Covertype/Forest Cover	0.0096	indicator

The Metrics

Receiver operating characteristic (ROC) is commonly used for classifiers, expressing on one axis – the rate of true positive (TPR) against the false positive rate (FPR) at various threshold levels. The Area Under the Curve (AUC) is the probability that the classifier ranks a randomly chosen positive instance higher than a randomly chosen negative one typically:

CodeCogsEqn-9

Anomaly score commonly used for outlier detection :

CodeCogsEqn-12

Correlation matrix :

It examines the outlier detection accuracy of our unsupervised classifiers by using the correlation coefficient of our features by employing stochastic nearest neighbors to our feature set.

To construct correlation matrix, the correlation coefficients and the p-values of the correlations are used in conjunction where the significance of the correlation and coverage can be observed.

Pearson’s correlation coefficient is based on the covariance of the features divided by the standard deviation.

CodeCogsEqn-10

For the first part of the post, we will focus on AUC to keep it simple. On the followup posts, we will show how the other metrics become more and more useful .

Testing

On a 50-50 (train/test) split, Scikit isolation forest scored a mean AUC score of 0.906 , the non-tweaked version showed an AUC of 0.875. With the tweak, it increased to 0.9 and showed similar accuracy to Scikit implementation.

We continued and measured the same data sets with extensions 1…N-1 dimensions of each data set. We observed the performance loss or gain of the algorithm over each data set with the extensions and the effectiveness of the tweak. Below are the results for extension 1…N/2 of dimensions for each of the datasets. With the extended nodes, we observed a general loss and not much of a difference between the tweaked and non-tweaked versions.

At this point, we wanted to reduce the long testing time and optimize the solution.

The time it took to traverse and calculate the path for each tree in the forest was largely affected by the number of dot product operations. We wanted to find the best approach to reduce the time complexity and the number of trees required to perform as good or better than Scikit’s implementation without external implementation in C/C++ as Nicolas Groix and Alexandre Gramfort’s Scikit or Liu, Ting, and Zhou’s R module.

Liu et al. suggested SDGain measurement to evaluate the splitting criterion:

CodeCogsEqn-13

Liu et al. showed that the described gain was efficiently separating different distributions while keeping the bimodality. There’s a performance penalty of calculating the standard deviation on the data. By applying it only to the hyperplanes, we reduced the data size factor from the calculation as the complexity of the calculation relied merely only the number of hyperplanes or nodes in the tree. Our gain will be used to evaluate the fitness of the hyperplanes rather than the data association for each tree, and will not be used to determine the split index of the sample/feature as originally intended.

The last thing missing from the construction is the vertex as described in the SciForest paper :

CodeCogsEqn-14

The vertex fixes the low and high bound values and can be used to minimize the search for the path length and/or penalize unseen data with shorter path length.

Re-Testing

Since the cost of the tree construction is relatively low, we can automatically grow a bigger forest (Overplant) or shrink a forest (Deforestation) to fit the data better and ignore trees that were generated by noisy samples. The deforestation strategy can either select the trees that are closer to the average gain (Avg. Fit) or the trees with the best gain.

To illustrate, let’s look at the previous example again with Avg. Fit and Best Fit with our modified normal and without it (upper left and right) as well as the same configurations with the acceptable range (bottom left and bottom right ), in comparison to Scikit:

We tested the same datasets with our modified configurations.

Deforestation :

Cut trees based on best gain (Best Fit) + Acceptable range check
Cut trees based on best gain (Best Fit) – Acceptable range check
Cut trees based on avg. gain (Avg. Fit) + Acceptable range check
Cut trees based on avg. gain (Avg. Fit) – Acceptable range check

For each configuration with 50-50 split and 10-90 split respectively in an effort to validate the consistency of the results over the same data sets. The shrink and growth factors were arbitrarily selected to plant twice the euler gamma (115.44%) to grow, and the shrinking factor was set to 18.5% of the original estimator size.

Scikit-learn 10-90

Avg. fit w/o range 10-90

Best fit w/o range 10-90

Best Fit – Acceptable range 10-90

Avg. Fit – Acceptable range 10-90

Best Fit – No acceptable range 50-50

Avg. fit – No acceptable range 50-50

image17 (1)

Best Fit – No acceptable range 50-50

Best Fit – acceptable range 50-50

The best configuration recorded was without the acceptable range check, with a mean AUC score of 0.954 vs. Scikit AUC mean score of 0.948 for the 10-90 split.

We also tested with path punishment, described earlier where unseen data is punished with a shorter path length and the variation of the results.

For Best Fit with acceptable range and punishment, the mean AUC was largely not affected and the mean score recorded was 0.911. For the average gain with acceptable range and punishment, the mean AUC dropped to 0.872. Without the acceptable range check on average gain fit, the mean AUC was 0.925. Lastly, the Best Fit without the acceptable range and with the punishment lowered the AUC Score to 0.910.

Minimization

Construction of a single tree composing the forest would naturally be computationally favorable since it will reduce the complexity by the number of trees. The strategy for merging the trees would be to iterate to the deepest level, and merge the nodes where the hyperplanes on the counterpart tree do not exist on the current tree or they would complement an existing branch.

Below are several results taken from minimization to 1, 3, and log2 (dimensions).

One avg. gain tree

Three avg. gain trees

log2(Dimensions)

2*log2(Dimensions)

Reducing to one or three trees, as you can see above, resulted in a slight accuracy loss over the given datasets. We could in practice use the same technique to merge the trees now and filter the noisy and inconsistent hyperplanes that would result in false positives for any given data set.

For a final test for this part of the blog post, we turned the percent10 flag off and tested the entire datasets. This is to compare the results of our single pruned tree to those of Scikit with 100 estimators and 1 estimator:

On the full dataset with a 10/90 training-testing split, we consistently acquired a mean AUC of 0.976 with a reasonable time performance. Scikit showed 0.968 with 100 estimators and 0.861 with a single tree.

We will introduce and discuss the IsolationHammingMatrix code on the next blog post, and evaluate it on additional outlier data sets by using the evaluation metrics mentioned here.

Additional References (not linked in the text):

One class SVM: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2099486

LOF: http://www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdf

Model Agnostic Meta Learning: https://arxiv.org/abs/1703.03400

SeqGAN: https://arxiv.org/abs/1609.05473

Random Forests: https://link.springer.com/article/10.1023/A:1010933404324

Interested in accessing the white paper on Pentoma’s GAMAN? Register here.

Categories:FuzzingGAMANIsolation ForestOffensive SecurityPenetration TestingPentomaRForgeSCIForestSCIKITLearnUnsupervised Learning