{"id":19,"date":"2021-11-24T04:09:50","date_gmt":"2021-11-24T10:09:50","guid":{"rendered":"http:\/\/amolmavuduru.blog\/2021\/11\/24\/how-to-perform-anomaly-detection-with-the-isolation-forest-algorithm\/"},"modified":"2021-11-24T04:09:50","modified_gmt":"2021-11-24T10:09:50","slug":"how-to-perform-anomaly-detection-with-the-isolation-forest-algorithm","status":"publish","type":"post","link":"https:\/\/datasciencedemystified.com\/?p=19","title":{"rendered":"How to perform anomaly detection with the Isolation Forest algorithm"},"content":{"rendered":"<h4>How you can use this tree-based algorithm to detect outliers in your\u00a0data<\/h4>\n<figure class=\"wp-caption\"><img decoding=\"async\" data-width=\"3264\" data-height=\"2448\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*UsxwS9qFOqn7n3Rvl8FxlQ.jpeg\"><figcaption class=\"wp-caption-text\">Photo by <a href=\"https:\/\/unsplash.com\/@skamenar?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\" target=\"_blank\" rel=\"noopener\">Steven Kamenar<\/a> on\u00a0<a href=\"https:\/\/unsplash.com\/s\/photos\/forest?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\" target=\"_blank\" rel=\"noopener\">Unsplash<\/a><\/figcaption><\/figure>\n<p>Anomaly detection is a frequently overlooked area of machine learning. It doesn\u2019t seem anywhere near as flashy as deep learning or natural language processing and often gets skipped entirely in machine learning courses.<\/p>\n<p>However, anomaly detection is still important and has applications ranging from data preprocessing to fraud detection and even system health monitoring. There are many anomaly detection algorithms but the fastest algorithm at the time of writing this article is the Isolation Forest, also known as the iForest.<\/p>\n<p><strong>In this article, I will briefly explain how the Isolation Forest algorithm works and also demonstrate how you can use this algorithm for anomaly detection in Python.<\/strong><\/p>\n<h3>How the Isolation Forest Algorithm Works<\/h3>\n<p>The Isolation Forest Algorithm takes advantage of the following properties of anomalous samples (often referred to as outliers):<\/p>\n<ul>\n<li><strong>Fewness<\/strong>\u200a\u2014\u200aanomalous samples are a minority and there will only be a few of them in any dataset.<\/li>\n<li><strong>Different\u200a<\/strong>\u2014\u200aanomalous samples have values\/attributes that are very different from those of normal samples.<\/li>\n<\/ul>\n<p>These two properties make it easier to isolate anomalous samples from the rest of the data in comparison to normal points.<\/p>\n<figure class=\"wp-caption\"><img decoding=\"async\" data-width=\"1516\" data-height=\"678\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*6GFMewU1Aax57nsW2uSakQ.png\"><figcaption class=\"wp-caption-text\">Isolating an anomalous point versus a normal point. Image by the\u00a0author.<\/figcaption><\/figure>\n<p>Notice how in the figure above, we can isolate an anomalous point from the rest of the data with just one line, while the normal point on the right requires four lines to isolate completely.<\/p>\n<h4>The Algorithm<\/h4>\n<p>Given a sample of data points <em>X<\/em>,<em> <\/em>the Isolation Forest algorithm builds an Isolation Tree (iTree), <em>T<\/em>, using the following steps.<\/p>\n<ol>\n<li><strong>Randomly select an attribute <em>q <\/em>and a split value <em>p.<\/em><\/strong><\/li>\n<li><strong>Divide <em>X <\/em>into two subsets by using the rule <em>q &lt; p<\/em>. The subsets will correspond to a left subtree and a right subtree in <em>T.<\/em><\/strong><\/li>\n<li><strong>Repeat steps 1\u20132 recursively until either the current node has only one sample or all the values at the current node have the same values.<\/strong><\/li>\n<\/ol>\n<p>The algorithm then repeats steps 1\u20133 multiple times to create several Isolation Trees, producing an Isolation Forest. Based on how Isolation Trees are produced and the properties of anomalous points, we can say that most anomalous points will be located closer to the root of the tree since they are easier to isolate when compared to normal points.<\/p>\n<figure class=\"wp-caption\"><img decoding=\"async\" data-width=\"1148\" data-height=\"694\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*ojkOKK2Ro5efuLljEgpPmw.png\"><figcaption class=\"wp-caption-text\">An example of an isolation tree created from a small dataset. Image by the\u00a0author.<\/figcaption><\/figure>\n<p>Once we have an Isolation Forest (a collection of Isolation Trees) the algorithm uses the following anomaly score given a data point <em>x<\/em> and a sample size of <em>m<\/em>:<\/p>\n<figure class=\"wp-caption\"><img decoding=\"async\" data-width=\"510\" data-height=\"166\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*4Ghr9I4MbO4CQNv9TxYV5g.png\"><figcaption class=\"wp-caption-text\">The anomaly score for the Isolation Forest algorithm. Image by the\u00a0author.<\/figcaption><\/figure>\n<p>In the equation above, <em>h(x) <\/em>represents the path length of the data point <em>x i<\/em>n a given Isolation Tree. The expression E(h(x)) represents the expected or \u201caverage\u201d value of this path length across all the Isolation Trees. The expression<em> c(m) <\/em>represents the average value of <em>h(x) <\/em>given a sample size of <em>m <\/em>and is defined using the following equation.<\/p>\n<figure class=\"wp-caption\"><img decoding=\"async\" data-width=\"826\" data-height=\"298\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*LHTgVkMMgk855Q9gRUG7cg.png\"><figcaption class=\"wp-caption-text\">The average value of h(x) for a sample size m. Image by the\u00a0author.<\/figcaption><\/figure>\n<p>The equation above is derived from the fact that an Isolation Tree has the same structure as a binary search tree. The termination of a node in an Isolation Tree is similar to an unsuccessful search in a binary search tree as far as the path length is concerned. Once the anomaly score <em>s(x, m)<\/em> is computed for a given point, we can detect anomalies using the following criteria:<\/p>\n<ol>\n<li><strong>If <em>s(x, m) <\/em>is close to 1 then <em>x<\/em> is very likely to be an anomaly.<\/strong><\/li>\n<li><strong>If <em>s(x, m) <\/em>is less than 0.5 then <em>x <\/em>is likely a normal point.<\/strong><\/li>\n<li><strong>If <em>s(x, m) <\/em>is close to 0.5 for all of the points in the dataset then the data likely does not contain any anomalies.<\/strong><\/li>\n<\/ol>\n<p>Keep in mind that the anomaly score will always be greater than zero but less than 1 for all points so it is quite similar to a probability score.<\/p>\n<h3>Using the Isolation Forest Algorithm in Scikit-learn<\/h3>\n<p>The Isolation Forest algorithm is available as a module in Scikit-learn. In this tutorial, I will demonstrate how to use Scikit-learn to perform anomaly detection with this algorithm. You can find the <a href=\"https:\/\/github.com\/AmolMavuduru\/IsolationForestTutorial\" target=\"_blank\" rel=\"noopener\">full code<\/a> for this tutorial on <a href=\"https:\/\/github.com\/AmolMavuduru\/IsolationForestTutorial\" target=\"_blank\" rel=\"noopener\">GitHub<\/a>.<\/p>\n<h4>Import Libraries<\/h4>\n<p>In the code below, I imported some commonly used libraries as well as the Isolation Forest module from Scikit-learn.<\/p>\n<pre>import numpy as np<br>import pandas as pd<br>from sklearn.ensemble import IsolationForest<br>import matplotlib.pyplot as plt<br>%matplotlib inline<\/pre>\n<h4>Building the\u00a0Dataset<\/h4>\n<p>In order to create a dataset for demonstration purposes, I made use of the <strong>make_blobs<\/strong> function from Scikit-learn to create a cluster and added some random outliers in the code below. The entire dataset contains 500 samples and of those 500 samples, only 5 percent or 25 samples are actually anomalies.<\/p>\n<pre>from sklearn.datasets import make_blobs<\/pre>\n<pre>n_samples = 500<br>outliers_fraction = 0.05<br>n_outliers = int(outliers_fraction * n_samples)<br>n_inliers = n_samples - n_outliers<\/pre>\n<pre>blobs_params = dict(random_state=0, n_samples=n_inliers, n_features=2)<\/pre>\n<pre>X = make_blobs(centers=[[0, 0], [0, 0]], <br>               cluster_std=0.5,<br>               **blobs_params)[0]<\/pre>\n<pre>rng = np.random.RandomState(42)<\/pre>\n<pre>X = np.concatenate([X, rng.uniform(low=-6, high=6, size=(n_outliers, 2))], axis=0)<\/pre>\n<h4>Training the Algorithm<\/h4>\n<p>Training an Isolation Forest is easy to do with Scikit-learn\u2019s API as demonstrated in the code below. Notice that I specified the number of iTrees in the <strong>n_estimators <\/strong>argument. There is also another argument named <strong>contamination <\/strong>that we can use to specify the percentage of the data that contains anomalies. However, I decided to omit this argument and use the default value because in realistic anomaly detection situations this information will likely be unknown.<\/p>\n<pre>iForest = IsolationForest(n_estimators=20, verbose=2)<br>iForest.fit(X)<\/pre>\n<h4>Predicting Anomalies<\/h4>\n<p>Predicting anomalies is straightforward as demonstrated in the code below.<\/p>\n<pre>pred = iForest.predict(X)<\/pre>\n<p>The <strong>predict<\/strong> function will assign a value of either 1 or -1 to each sample in X. A value of 1 indicates that a point is a normal point while a value of -1 indicates that it is an anomaly.<\/p>\n<h4>Visualizing Anomalies<\/h4>\n<p>Now that we have some predicted labels for each sample in X, we can visualize the results with Matplotlib as demonstrated in the code below.<\/p>\n<pre>plt.scatter(X[:, 0], X[:, 1], c=pred, cmap='RdBu')<\/pre>\n<p>The code above produces the visualization below.<\/p>\n<figure class=\"wp-caption\"><img decoding=\"async\" data-width=\"932\" data-height=\"656\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*cadnj4r0dHnRlR5gghrsmQ.png\"><figcaption class=\"wp-caption-text\">Visualization with anomalous points in red and normal points in blue. Image by the\u00a0author.<\/figcaption><\/figure>\n<p>In the visualization below, the points that were labeled as anomalies are in red and the normal points are in blue. The algorithm seems to have done a good job at detecting anomalies, although some points at the edge of the cluster in the center may be normal points that were labeled as anomalies instead. Taking a look at the anomaly scores might help us understand the algorithm\u2019s performance a little better.<\/p>\n<h4>Visualizing Anomaly\u00a0Scores<\/h4>\n<p>We can generate simplified anomaly scores using the <strong>score_samples<\/strong> function as shown below.<\/p>\n<pre>pred_scores = -1*iForest.score_samples(X)<\/pre>\n<p>Note that the scores returned by the <strong>score_samples<\/strong> function will all be negative and correspond to the negative value of the anomaly score defined earlier in this article. For the sake of consistency, I have multiplied these scores by -1.<\/p>\n<p>We can visualize the scores assigned to each point using the code below.<\/p>\n<pre>plt.scatter(X[:, 0], X[:, 1], c=pred_scores, cmap='RdBu')<br>plt.colorbar(label='Simplified Anomaly Score')<br>plt.show()<\/pre>\n<p>The code above gives us the following useful visualization.<\/p>\n<figure class=\"wp-caption\"><img decoding=\"async\" data-width=\"920\" data-height=\"648\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*DaOHiM5QUtTMP1ao_WQmrQ.png\"><figcaption class=\"wp-caption-text\">Visualization of anomaly scores. Image by the\u00a0author.<\/figcaption><\/figure>\n<p>When looking at the visualization above, we can see that the algorithm did work as expected since the points that are closer to the blue cluster in the middle have lower anomaly scores, and the points that are further away have higher anomaly scores.<\/p>\n<h3>Summary<\/h3>\n<p>The Isolation Forest algorithm is a fast tree-based algorithm for anomaly detection. The algorithm uses the concept of path lengths in binary search trees to assign anomaly scores to each point in a dataset. Not only is the algorithm fast and efficient, but it is also widely accessible thanks to Scikit-learn\u2019s implementation.<\/p>\n<p>As usual, you can find the <a href=\"https:\/\/github.com\/AmolMavuduru\/IsolationForestTutorial\" target=\"_blank\" rel=\"noopener\">full code<\/a> for this article on <a href=\"https:\/\/github.com\/AmolMavuduru\/IsolationForestTutorial\" target=\"_blank\" rel=\"noopener\">GitHub<\/a>.<\/p>\n<h3>Join my Mailing\u00a0List<\/h3>\n<p>Do you want to get better at data science and machine learning? Do you want to stay up to date with the latest libraries, developments, and research in the data science and machine learning community?<\/p>\n<p>Join my <a href=\"https:\/\/mailchi.mp\/e8dd82679724\/amols-data-science-blog\" target=\"_blank\" rel=\"noopener\">mailing list <\/a>to get updates on my data science content. You\u2019ll also get my free <strong>Step-By-Step Guide to Solving Machine Learning Problems<\/strong> when you <a href=\"https:\/\/mailchi.mp\/e8dd82679724\/amols-data-science-blog\" target=\"_blank\" rel=\"noopener\">sign up<\/a>!<\/p>\n<p>And while you\u2019re at it, consider <a href=\"https:\/\/amolmavuduru.medium.com\/membership\" target=\"_blank\" rel=\"noopener\">joining the Medium community<\/a> to read articles from thousands of other writers as well.<\/p>\n<h3>Sources<\/h3>\n<ol>\n<li>F. T. Liu, K. M. Ting, and Z. H. Zhou, <a href=\"https:\/\/cs.nju.edu.cn\/zhouzh\/zhouzh.files\/publication\/icdm08b.pdf?q=isolation-forest\" target=\"_blank\" rel=\"noopener\">Isolation Forest<\/a>, (2008), 2008 Eighth IEEE International Conference on Data Mining.<\/li>\n<li>F. Pedregosa <em>et al, <\/em><a href=\"http:\/\/jmlr.csail.mit.edu\/papers\/v12\/pedregosa11a.html\" target=\"_blank\" rel=\"noopener\">Scikit-learn: Machine Learning in Python<\/a>, (2011), Journal of Machine Learning Research.<\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>How you can use this tree-based algorithm to detect outliers in your\u00a0data Photo by Steven Kamenar on\u00a0Unsplash Anomaly detection is a frequently overlooked area of machine learning. It doesn\u2019t seem anywhere near as flashy as deep learning or natural language processing and often gets skipped entirely in machine learning courses. However, anomaly detection is still [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-19","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/datasciencedemystified.com\/index.php?rest_route=\/wp\/v2\/posts\/19","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/datasciencedemystified.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/datasciencedemystified.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/datasciencedemystified.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/datasciencedemystified.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=19"}],"version-history":[{"count":0,"href":"https:\/\/datasciencedemystified.com\/index.php?rest_route=\/wp\/v2\/posts\/19\/revisions"}],"wp:attachment":[{"href":"https:\/\/datasciencedemystified.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=19"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/datasciencedemystified.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=19"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/datasciencedemystified.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=19"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}